（21-6-01）基于Gemma 2B模型的智能文本摘要系统：实验（1）

码农三叔

已于 2024-05-25 21:21:26 修改

阅读量907

点赞数 13

分类专栏：《NLP算法实战》大模型从入门到实战文章标签： python 深度学习 langchain 人工智能语言模型 NLP Gemma 2B

于 2024-05-25 20:32:01 首次发布

本文链接：https://blog.csdn.net/asd343442/article/details/139203017

版权

大模型从入门到实战同时被 2 个专栏收录

169 篇文章 45 订阅

订阅专栏

《NLP算法实战》

127 篇文章 15 订阅

订阅专栏

9.6 实验

这里的“实验”指的是使用设置好的文本摘要管道对文档进行处理，以验证和测试不同摘要技术和策略的效果。在本节的内容中，将详细讲解使用文本摘要管道处理文档的过程。

9.6.1 实验的目的和步骤

在本项目中，实验的目的如下所示。

评估摘要质量：确定生成的摘要是否准确地捕捉了原始文档的核心内容和要点。
比较不同方法：比如比较 Stuffing、MapReduce 和 Refine 等不同摘要策略的效果。
优化参数：通过实验不同的参数设置，找到最佳的摘要生成配置。
确定最佳实践：了解在特定类型的文档或特定要求下，哪种摘要方法或参数设置最有效。
性能测试：评估摘要管道在处理大量数据时的性能，包括处理速度和资源消耗。
模型泛化能力：测试模型在不同类型的文档和不同领域上的表现，以评估其泛化能力。

在大模型应用中，实验的实现步骤通常如下所示。

（1）使用 LangChain 包装 Hugging Face 管道，以便更好地集成和控制文本生成过程。

（2）对分割后的文档块应用 MapReduce 策略，即对每个块单独生成摘要，然后将这些摘要合并。

（3）应用 Refine 策略，对生成的摘要进行细化，提高其质量和可读性。

（4）测试不同的文档分割策略，比如基于字符的分割和基于 HTML 标题的分割，以确定哪种最适合您的文档类型。

通过实验可以更深入地了解文本摘要过程，并找到最适合您需求的方法。实验结果还可以帮助您改进摘要管道，以生成更准确、更有用的摘要。

9.6.2 使用摘要管道处理文档

（1）将一个预先配置的 Hugging Face 管道封装到一个自定义类 GemmaLLM 中，这个类继承自 LangChain 的 LLM（大型语言模型）基类。

with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()

class GemmaLLM(LLM):
    hf_pipe: Any = None
    pipe_kwargs: Any = None
        
    def __init__(self, hf_pipeline, pipe_kwargs):
        super(GemmaLLM, self).__init__()
        self.hf_pipe = hf_pipeline
        self.pipe_kwargs = pipe_kwargs

    @property
    def _llm_type(self):
        return "Gemma pipeline"

    def _call(self, prompt, **kwargs):
        outputs = self.hf_pipe(
            prompt,
            do_sample=self.pipe_kwargs['do_sample'],
            temperature=self.pipe_kwargs['temperature'],
            top_k=self.pipe_kwargs['top_k'],
            top_p=self.pipe_kwargs['top_p'],
            add_special_tokens=self.pipe_kwargs['add_special_tokens']
        )
        return outputs[0]["generated_text"][len(prompt):]  

    @property
    def _identifying_params(self):
        return {"n": self.pipe_kwargs}

langchain_hf = GemmaLLM(hf_pipeline=pipe,
                        pipe_kwargs={
                            'do_sample':True,
                            'temperature':0.1,
                            'top_k':20,
                            'top_p':0.3,
                            'add_special_tokens':True
                })

对上述代码的具体说明如下所示：

清理 GPU 缓存：使用 torch.no_grad() 确保在代码块中不会跟踪梯度，这对于模型推理（而不是训练）是有用的。torch.cuda.empty_cache() 清除未被引用的缓存，gc.collect() 调用垃圾回收器，这些都是为了释放内存。
定义类GemmaLLM：创建了一个封装了 Hugging Face 管道的类，它将被 LangChain 使用。
初始化方法 __init__：接收一个 Hugging Face 管道实例和一个包含管道参数的字典，并将其保存在类的属性中。
属性_llm_type：返回一个字符串，说明这个语言模型的类型，这里是 "Gemma pipeline"。
方法_call：这是 LangChain 框架调用的核心方法，它接收一个提示（prompt），并使用封装的 Hugging Face 管道生成文本。该方法返回生成文本，但不包括提示部分。
属性_identifying_params：返回一个字典，包含了用于识别模型的参数，这里是指 pipe_kwargs。
实例化 GemmaLLM：使用Hugging Face 管道 pipe 和一系列参数来创建 GemmaLLM 的一个实例，这些参数包括 do_sample、temperature、top_k、top_p和 add_special_tokens，它们控制着文本生成的过程。

通过这种方式，可以在 LangChain 框架中重用Hugging Face 管道，并且可以轻松地将相同的管道应用于不同的 LangChain 应用程序或流程中。这种封装也使得对管道参数的更改更加容易，因为开发者只需要在一个地方更新它们即可。

注意：上面代码中的langchain_hf 是 GemmaLLM 类的一个实例，它现在可以被用在 LangChain 中进行文本生成任务。这个实例保留了我们为 Hugging Face 管道设置的所有特定参数，使得它非常适合进行实验和原型设计。

（2）打印输出变量 prompt 的前 350 个字符，请注意，如果 prompt 字符串的长度小于 350 个字符，那么输出将只包含字符串的全部内容。如果它包含敏感信息或不应被截断的信息，那么这个操作可能不会按预期工作。

print(prompt[:350])

执行后会输出：

<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

<h2>TLDR</h2>
<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer mod

（3）调用封装在类GemmaLLM中的 Hugging Face 管道来生成文本，将语言模型生成的文本以一种格式化的方式呈现出来，使其易于阅读，并且适合在 Markdown 支持的环境中展示，如 Kaggle 或 Jupyter Notebook。

out = langchain_hf.invoke(prompt)
display(Markdown(out.replace('#', '')))

此时执行后会输出：

Chapter 1: Introduction

Overview of the project: using an EfficientNet-B0 model for lip and pose classification.
Data preparation:
18 lip points, 20 pose points, and all hand points were extracted.
Various augmentations and transformer pre-processing were applied.
The input size was 160x80x3.
Chapter 2: Data Preprocessing

CNN pre-processing:
Global affine, shift-scale-rotate, and flip pose were applied.
Mixup augmentation was used for CNNs.
Transformer pre-processing:
Only 61 points were kept, including 40 lip points and 21 hand points.
Randomly selected distances and angles were included.
Chapter 3: Training

CNN training:
One-fold cross-validation with a random split and 0.1 warm-up.
Weighted cross-entropy loss with class weights.
EfficientNet-B0 with 5 blocks and 256 hidden units.
Transformer training:
One-fold cross-validation with a random split and 0.1 warm-up.
Ranger optimizer with 60% flat and 40% cosine annealing learning rate schedule.
4-layer, 256 hidden-size, 512 intermediate-size transformer.
Chapter 4: Hyperparameter Tuning

Optuna was used to tune most parameters.
The parameters list for CNN and transformer training are provided.
Chapter 5: Submissions and Ensemble

EfficientNet-B0 achieved a leaderboard score of approximately 0.8.
Ensemble of EfficientNet-B0, BERT, and DeBERTa was created.
A key feature was using the ensemble without softmax, which provided a boost of around 0.01.
Chapter 6: Conclusion

The project achieved a high accuracy on the lip and pose classification task.
The EfficientNet-B0 model with ensemble achieved the best performance.
The conversion of DepthwiseConv2D operation was a challenge, but a faster version was developed.

9.6.3 MapReduce策略的摘要

MapReduce 策略是一种处理和简化大型数据集（包括文本数据）的方法，特别是在需要并行处理大量数据时。这个策略受到 Google 开发的 MapReduce 编程模型的启发，该模型用于大规模数据集的分布式处理。在自然语言处理（NLP）和文本摘要的上下文中，MapReduce 策略包括如下所示的两个阶段。

Map 阶段：在这个阶段，每个文档块独立地被处理以生成摘要。在这个例子中，使用 prompt_init 作为提示，对每个块运行 langchain_hf 来生成技术性摘要。
Reduce 阶段：在这个阶段，Map 阶段生成的所有摘要被合并成一个连贯的最终摘要。使用 combine_prompt 作为提示，指示模型如何将各个部分的摘要合并。

通过这种方式，MapReduce 策略可以有效地处理大型文档，生成结构化和连贯的摘要。这种方法特别适用于文档被自然分割成多个部分的情况，如基于 HTML 标题的分割。

（1）使用 LangChain和前面定义的 GemmaLLM 类实例 langchain_hf 实现 MapReduce 策略的文本摘要。

# 定义每个文本块摘要的提示模板
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)

# 定义最终输出摘要的提示模板，即摘要的汇总
combine_template = """<bos><start_of_turn>user
You are given a text containing summaries of different part of a document.
Create one single summary combining all the information of the chapters. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
combine_prompt = PromptTemplate.from_template(combine_template)

# 使用map_reduce策略创建摘要链
# langchain_hf 是之前创建的封装了Hugging Face管道的GemmaLLM实例
# map_prompt 是用于Map阶段的提示模板
# combine_prompt 是用于Reduce阶段的提示模板
chain = load_summarize_chain(langchain_hf, chain_type='map_reduce', map_prompt=prompt_init, combine_prompt=combine_prompt)

# 对分割后的文档块运行摘要链
# splits 是之前通过HTML标签分割得到的文档块列表
out_summary = chain.invoke(splits)

对上述代码的具体说明如下所示：

定义摘要提示模板：创建了一个用于每个文本块摘要的提示模板 prompt_template。这个模板包含了对模型的指示，要求它以技术性的方式总结文本，关注事实、数字和使用的战略，并使用项目符号分章节。

初始化提示模板：使用 PromptTemplate.from_template(prompt_template) 创建了一个 PromptTemplate 实例 prompt_init，它将用于 Map 阶段的每个文本块。
定义最终输出提示模板：创建一个用于生成最终摘要的提示模板 combine_template，用于指示模型将不同部分的摘要合并成一个单一的摘要。
创建合并提示：使用 PromptTemplate.from_template(combine_template) 创建一个 PromptTemplate 实例 combine_prompt，将被用于 Reduce 阶段。
创建 MapReduce 链：使用函数load_summarize_chain创建一个 MapReduce 摘要链 chain，这个链使用 langchain_hf 作为生成摘要的模型，map_prompt 作为 Map 阶段的提示，combine_prompt 作为 Reduce 阶段的提示。
运行 MapReduce 链：使用 chain.invoke(splits) 对之前分割得到的文档块 splits 运行 MapReduce 链，这将生成每个块的摘要，并将这些摘要合并成一个最终的摘要。

（2）通过下面的代码，可以在 Jupyter Notebook 中以 Markdown 格式安全地的模式展示文本摘要信息，同时避免了由于 '#' 符号可能导致的任何潜在的 Markdown 格式冲突。这为用户提供了一个整洁且格式化的摘要视图，便于阅读和理解。

display(Markdown(out_summary['output_text'].replace('#', '')))

执行后会输出：

Chapter 1: Introduction

The task is to classify images into different categories.
We use an approach similar to audio spectrogram classification.
We use multiple models, including EfficientNet-B0 and DeBERTa.
Chapter 2: Model Architecture

EfficientNet-B0 model with input size of 160x80.
Transformer models (BERT and DeBERTa) as helper models.
The final solution consists of one EfficientNet-B0 with an input size of 160x80.
Chapter 3: Training

We use 8 randomly split folds for training.
A single fold model is trained on each fold.
We use a single EfficientNet-B0 model with an input size of 160x80.
Chapter 4: Evaluation

We use a single fold for evaluation.
The model has a CV score of 0.898.
The model has a leaderboard score of ~0.8.
Chapter 5: Data Preprocessing

Extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points.
Applied various augmentations to the data.
Implemented standard normalization.
Filled in NaN values with zeros.
Interpolated the time axis to a size of 160 using 'nearest' interpolation.
Chapter 6: Feature Extraction

Only 61 points were kept, including 40 lip points and 21 hand points.
For left and right hand, the one with less NaN was kept.
If right hand was kept, mirror it to left hand.
Chapter 7: Feature Engineering

Hand-crafted features were also used, including motion, distances, and cosine of angles.
Motion features consist of future motion and history motion.
Full 210 pairwise distances among 21 hand points were included.
15 angles of 5 fingers were included.
Chapter 8: Data Augmentation

Sequences longer than 96 were interpolated to 96.
Sequences shorter than 96 were unchanged.

未完待续

码农三叔

关注

13
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
（21-6-01）基于Gemma 2B模型的智能文本摘要系统：实验（1）

（1）将一个预先配置的 Hugging Face 管道封装到一个自定义类 GemmaLLM 中，这个类继承自 LangChain 的 LLM（大型语言模型）基类。@propertyprompt,@property})清理 GPU 缓存：使用 torch.no_grad() 确保在代码块中不会跟踪梯度，这对于模型推理（而不是训练）是有用的。torch.cuda.empty_cache() 清除未被引用的缓存，gc.collect() 调用垃圾回收器，这些都是为了释放内存。
复制链接

扫一扫