（21-6-03）基于Gemma 2B模型的智能文本摘要系统：实验（3）

码农三叔

已于 2024-05-26 10:36:18 修改

阅读量645

点赞数 18

分类专栏：《NLP算法实战》大模型从入门到实战文章标签：深度学习人工智能 python 自然语言处理语言模型 transformer langchain

于 2024-05-26 10:35:58 首次发布

本文链接：https://blog.csdn.net/asd343442/article/details/139211830

版权

大模型从入门到实战同时被 2 个专栏收录

169 篇文章 45 订阅

订阅专栏

《NLP算法实战》

127 篇文章 15 订阅

订阅专栏

9.6.6 策略选择函数

（1）到现在为止，已经实现了多种策略的生成文档摘要的方案。为了便于选择策略，在下面的代码中定义了一系列策略函数，用于根据不同的策略对文档进行摘要处理。

def check_inputs(summarization_strategy, chunking_strategy):
    if summarization_strategy not in ['stuffing', 'map_reduce', 'refine']:
        raise ValueError(f'Wrong parameter "summarization_strategy": select either "stuffing", "map_reduce" or "refine". "{summarization_strategy}" was chosen instead.')
    if chunking_strategy not in ['html', 'character', 'html_character'] and summarization_strategy!='stuffing':
        raise ValueError(f'Wrong parameter "chunking_strategy": select either "html", "character" or "html_character". "{chunking_strategy}" was chosen instead.')
    
def prepare_prompt(langchain_pipeline, writeup, verbose):
    print('> Applying chat template to the prompt') if verbose else None
    messages = [
    {
        "role": "user",
        "content": "Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n{}".format(writeup)
    }
    ]

    prompt = langchain_pipeline.hf_pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return prompt

def mapreduce_strategy(langchain_pipeline, chunks):
    prompt_template = """<bos><start_of_turn>user
    Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

    {text}<end_of_turn>
    <start_of_turn>model"""
    prompt_init = PromptTemplate.from_template(prompt_template)


    combine_template = """<bos><start_of_turn>user
    You are given a text containing summaries of different part of a document.
    Create one single summary combining all the information of the chapters. Divide the summary in chapters, be impersonal and use bullet points:

    {text}<end_of_turn>
    <start_of_turn>model"""
    combine_prompt = PromptTemplate.from_template(combine_template)

    chain = load_summarize_chain(langchain_pipeline,
                                 chain_type='map_reduce', 
                                 map_prompt=prompt_init, 
                                 combine_prompt=combine_prompt)

    out_summary = chain.invoke(chunks)
    
    return out_summary

def refine_strategy(langchain_pipeline, chunks):
    prompt_template = """<bos><start_of_turn>user
    Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

    {text}<end_of_turn>
    <start_of_turn>model"""
    prompt_init = PromptTemplate.from_template(prompt_template)

    refine_template = """<bos><start_of_turn>user
    Your job is to produce a final document divided in chapters and bullet points.
    You are given a text containing an existing summary to a certain point:

    {existing_answer}

    You can now refine it (if necessary) with more context below.

    {text}

    Given the new context, refine the original summary.<end_of_turn>
    <start_of_turn>model"""
    prompt_refine = PromptTemplate.from_template(refine_template)

    chain = load_summarize_chain(langchain_pipeline, chain_type='refine',
                                 return_intermediate_steps=True,
                                 input_key='input_documents',
                                 output_key='output_text',
                                 question_prompt=prompt_init,
                                 refine_prompt=prompt_refine)

    out_summary = chain.invoke(chunks, return_only_outputs=True)
    return out_summary
    
    
def prepare_chunks(text_to_split, chunking_strategy, verbose):
    print(f'> Preparing text chunking. Strategy: {chunking_strategy}') if verbose else None
    output_chunks = text_to_split # To avoid local variable referenced before assignment
    
    if (chunking_strategy == 'html') or (chunking_strategy == 'html_character'):   
        print(f'> Splitting at HTML level') if verbose else None
        # Split on HTML headers
        headers_to_split_on = [
            ("h1", "Header 1"),
            ("h2", "Header 2")
        ]
        text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, return_each_element=False)
        output_chunks = text_splitter.split_text(text_to_split)
        
        print(f"""
        Length writup: {len(text_to_split)}
        Number of html splits: {len(output_chunks)}
        Length of each split: {[len(i.page_content) for i in output_chunks]}
        """) if verbose else None

        for i, text in enumerate(output_chunks):
            # Join the metadata and the content together
            final_content = '\n'.join(text.metadata.values()) + '\n' + text.page_content
            # Replace the old content with the enriched one
            text.page_content = final_content

    if (chunking_strategy == 'character') or (chunking_strategy == 'html_character'):  
        print(f'> Splitting at character level (newline char)') if verbose else None
        text_splitter = CharacterTextSplitter(separator='\n', chunk_size=2000, chunk_overlap=100)
        if chunking_strategy == 'character':
            # split_text doesn't automatically create Documents object in LangChain if we only want character splitting (and no html)
            docs = text_splitter.create_documents([output_chunks])
            output_chunks = text_splitter.split_documents(docs)
        else:
            # otherwise, htmlsplitter does create documents that we can further split 
            output_chunks = text_splitter.split_documents(output_chunks)
        
        print(f"""
        Number of final splits: {len(output_chunks)}
        Length of each final split: {[len(i.page_content) for i in output_chunks]}
        """) if verbose else None
    
    return output_chunks
    
def summarize(langchain_pipeline, writeup, summarization_strategy='stuffing', chunking_strategy='html', verbose=True):
    check_inputs(summarization_strategy, chunking_strategy)
    print('> Begin summarization') if verbose else None
    if summarization_strategy == 'stuffing':
        print('> Summarization strategy: Stuffing. Ignoring chunking_strategy') if verbose else None
        prompt = prepare_prompt(langchain_pipeline=langchain_pipeline, writeup=writeup, verbose=verbose)
        print('> Invoking chain...') if verbose else None
        output = langchain_pipeline.invoke(prompt)
    elif summarization_strategy == 'map_reduce':
        print('> Summarization strategy: MapReduce') if verbose else None
        chunks = prepare_chunks(writeup, chunking_strategy, verbose)
        print('> Invoking chain...') if verbose else None
        output = mapreduce_strategy(langchain_pipeline=langchain_pipeline, chunks=chunks)['output_text'].replace('\n\n','\n')
    else:
        print('> Summarization strategy: Refine') if verbose else None
        chunks = prepare_chunks(writeup, chunking_strategy, verbose)
        print('> Invoking chain...') if verbose else None
        output = refine_strategy(langchain_pipeline=langchain_pipeline, chunks=chunks)['output_text']

    print('\n########## SUMMARY ##########\n') if verbose else None    
    return output

对上述函数的具体说明如下：

check_inputs：检查用户输入的 summarization_strategy 和 chunking_strategy 是否有效。
prepare_prompt：为指定的文档 writeup 准备提示模板，以便用于摘要生成。
mapreduce_strategy：实现 MapReduce 摘要策略，将文档分割成多个块，然后对每个块生成摘要，最后合并这些摘要。
refine_strategy：实现 Refine 摘要策略，先对文档的每个块生成初步摘要，然后根据新的上下文信息逐步细化这些摘要。
prepare_chunks：根据 chunking_strategy 准备文档块，可以基于 HTML 标签或字符数进行分割。
summarize：主函数，是整个流程的入口点，它首先验证输入参数，然后根据选择的策略调用相应的函数来生成摘要。如果选择的是 Stuffing 策略，会直接生成摘要而不考虑分割策略。如果选择的是 MapReduce 或 Refine 策略，会先准备文档块，然后调用相应的策略函数来生成摘要。

上述函数共同构成了一个摘要处理流程，允许用户根据自己的需求选择不同的摘要策略（Stuffing、MapReduce 或 Refine）和分割策略（基于 HTML、字符或两者结合）。此外，参数verbose允许用户选择是否在摘要过程中打印额外的详细信息，这对于调试和跟踪摘要过程非常有用。

（2）从 writeups 数据集中选取索引为 20 的行的第 10 列（在 Python 中索引从 0 开始，因此索引 9 表示第 10 列）作为 custom_writeup，然后打印输出这个自定义文档的前 500 个字符。

custom_writeup = writeups.iloc[20, 9]
print(custom_writeup[:500])

执行后会输出：

<p>Team members are junseonglee11 (@junseonglee11), Ayaan Jang(@ayaanjang). <br>
We ensembled 6 LSTM models (2 different versions).<br>
We modified Robin Smith's and Robert Hatch's notebooks.  </p>
<h1><strong>Our notebooks:</strong></h1>
<p>Inference: <a href="https://www.kaggle.com/code/ayaanjang/20th-tensorflow-lstm-model-inference-merged" target="_blank">https://www.kaggle.com/code/ayaanjang/20th-tensorflow-lstm-model-inference-merged</a><br>
Train: <a href="https://www.kaggle.com/code/junse

上述代码的作用是展示选定文档的一小部分内容，这样可以检查文档的内容是否正确加载，或者快速浏览文档的开头部分。如果有一个大型的文档或数据集，并且想要查看特定文档的内容，这种方法非常有用。

（3）调用前面定义的函数summarize生成文档 custom_writeup 的摘要，具体实现代码如下所示。

output = summarize(
    langchain_pipeline=langchain_hf, # Select the LangChain wrapper around the HF pipeline we init before
    writeup=custom_writeup, # Pass the writeup
    summarization_strategy='stuffing', 
    chunking_strategy='html' # Select the chunking strategy. If stuffing, this parameter is ignored
)
display(Markdown(output.replace('#', '')))

通过上述代码，可以在 Jupyter Notebook 或其他支持 Markdown 渲染的环境中以格式化的方式展示摘要结果，这种方法特别适用于需要以结构化和可读性强的方式展示文本摘要的情况。执行后会输出：

> Begin summarization
> Summarization strategy: Stuffing. Ignoring chunking_strategy
> Applying chat template to the prompt
> Invoking chain...

########## SUMMARY ##########

Summary

Chapter 1: Introduction

Team members used 6 LSTM models for an RNN model.
They modified existing notebooks and created new ones for data preprocessing and training.
Chapter 2: Data and Preprocessing

They preprocessed 96 time-series features from the "icecube" dataset.
They experimented with different feature engineering techniques to improve prediction accuracy.
They converted features to TFRecord format for efficient training.
Chapter 3: Model Training and Inference

They trained two versions of the model (with and without feature square root).
They used Adam optimizer and ensemble techniques to combine models.
They split the dataset into 10 folds for training and validation.
Chapter 4: Data Postprocessing

They modified the code to perform weighted average of predicted probabilities and calculate the azimuth and zenith directions of the particle.
They experimented with different post-processing techniques to improve the model's performance.
Chapter 5: Results and Discussion

They presented the results of training and post-processing.
They discussed the importance of feature engineering and ensemble techniques for improving the model's accuracy.

（4）调用前面定义的函数summarize生成生成一个名为 custom_writeup 的文档摘要，具体实现代码如下所示。

output = summarize(
    langchain_pipeline=langchain_hf, 
    writeup=custom_writeup, 
    summarization_strategy='refine', 
    chunking_strategy='html')

display(Markdown(output.replace('#', '')))

总的来说，上述代码的目的是根据用户选择的策略，自动生成文档 custom_writeup 的摘要，并将结果格式化为 Markdown格式，并在Jupyter Notebook中展示出来。执行后会输出：

> Begin summarization
> Summarization strategy: Refine
> Preparing text chunking. Strategy: html
> Splitting at HTML level

        Length writup: 7083
        Number of html splits: 6
        Length of each split: [179, 282, 586, 1818, 848, 1000]
        
> Invoking chain...

########## SUMMARY ##########

Chapter 1: Introduction

LSTM Models

LSTM models are a powerful type of recurrent neural network (RNN) known for their ability to process sequential data effectively. They possess a unique structure with feedback loops that enable them to capture long-term dependencies between consecutive data points.

Chapter 2: Ensemble Learning

Model Selection and Configuration

We select two distinct versions of LSTM models:

Robin Smith's notebook
Robert Hatch's notebook
We optimize these notebooks by adjusting hyperparameters and experimenting with different configurations.

Chapter 3: Experimentation and Evaluation

We conduct a comprehensive set of experiments to identify the optimal settings for the ensemble. We employ various metrics to evaluate the performance of the ensemble, including accuracy, precision, and recall.

Chapter 4: Results and Discussion

The results demonstrate that the ensemble of 2 LSTM models achieves a significant improvement in performance compared to the individual models. We analyze the insights gained from the analysis to understand the strengths and weaknesses of each model.

Chapter 5: Conclusion

In this project, we explored ensemble learning for LSTM models, showcasing its effectiveness in enhancing performance. The results provide valuable insights into the power of combining multiple models for improved accuracy and robustness.

码农三叔

关注

18
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
（21-6-03）基于Gemma 2B模型的智能文本摘要系统：实验（3）

通过上述代码，可以在 Jupyter Notebook 或其他支持 Markdown 渲染的环境中以格式化的方式展示摘要结果，这种方法特别适用于需要以结构化和可读性强的方式展示文本摘要的情况。总的来说，上述代码的目的是根据用户选择的策略，自动生成文档 custom_writeup 的摘要，并将结果格式化为 Markdown格式，并在Jupyter Notebook中展示出来。上述代码的作用是展示选定文档的一小部分内容，这样可以检查文档的内容是否正确加载，或者快速浏览文档的开头部分。
复制链接

扫一扫