（21-2）基于Gemma 2B模型的智能文本摘要系统：系统设置

最新推荐文章于 2024-07-08 17:54:34 发布

码农三叔

最新推荐文章于 2024-07-08 17:54:34 发布

阅读量667

点赞数 17

分类专栏：《NLP算法实战》大模型从入门到实战文章标签：语言模型人工智能自然语言处理 python langchain Gemma 2B NLP

本文链接：https://blog.csdn.net/asd343442/article/details/139155333

版权

大模型从入门到实战同时被 2 个专栏收录

169 篇文章 45 订阅

订阅专栏

《NLP算法实战》

127 篇文章 15 订阅

订阅专栏

9.3 系统设置

在项目开始之初需要配置好环境，准备好项目所用到的库、数据和Gemma模型，并通过Hugging Face的transformers库创建一个文本生成的管道，配置管道的相关参数如最大新令牌数和精度控制，以优化模型性能和输出。

9.3.1 准备环境和数据

（1）准备本项目所需的环境和工具，通过如下代码导需要的Python 库。

from transformers import pipeline, set_seed
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from accelerate.utils import release_memory
import torch
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
from datasets import Dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel
import pandas as pd
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter, HTMLHeaderTextSplitter
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
import evaluate
import transformers
from langchain.llms.base import LLM
from typing import Any
import warnings
import gc
import random
import numpy as np

（2）设置随机种子，以确保代码的可重复性。具体来说，它将随机数生成器的种子设置为42。通过这些设置，代码将在每次运行时生成相同的随机数序列，从而确保结果的可重复性。

set_seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42)

对上述代码的具体说明如下所示：

set_seed(42)：设置全局随机种子，这将影响到 Python、NumPy 和 PyTorch 的随机数生成。
torch.manual_seed(42)：设置 PyTorch 的随机种子，以确保在使用 PyTorch 进行训练时生成的随机数是确定性的。
torch.cuda.manual_seed(42)：设置 PyTorch 在 CUDA 上的随机种子，以确保在 GPU 上进行计算时生成的随机数是确定性的。
np.random.seed(42)：设置 NumPy 的随机种子，以确保在使用 NumPy 进行操作时生成的随机数是确定性的。
random.seed(42)：设置 Python 内置的随机数生成器的种子，以确保在使用 Python 内置的随机函数时生成的随机数是确定性的。

（3）设置和准备使用Hugging Face的Transformers库进行文本生成任务的环境，还包括登录 Hugging Face Hub 并加载 Gemma 模型的功能。

writeups = pd.read_csv('/input/kaggle-winning-solutions-methods/kaggle_winning_solutions_methods.csv')
writeups = writeups.drop_duplicates(subset=['link', 'writeup']).reset_index(drop=True)


hf_access_token = UserSecretsClient().get_secret("hf_token")
login(token = hf_access_token)
model = "/input/gemma/transformers/2b-it/3"

对上述代码的具体说明如下所示：

读取文档数据集：使用pandas库读取存储在/input/kaggle-winning-solutions-methods/kaggle_winning_solutions_methods.csv路径下的CSV文件。这个文件包含了Kaggle竞赛的获胜解决方案和方法的相关信息。然后，代码通过drop_duplicates方法移除重复的行，并使用reset_index(drop=True)重置索引。
登录Hugging Face：通过UserSecretsClient().get_secret("hf_token")安全地获取Hugging Face的访问令牌，并通过login函数使用该令牌登录Hugging Face平台。这样做可以让用户访问Hugging Face Hub上的模型和其他资源。
指定模型路径：设置model变量为Gemma模型的路径，该模型位于/input/gemma/transformers/2b-it/3。这个路径指向了Kaggle输入中的Gemma 2B模型，它是一个大型的语言模型，适用于文本生成任务。

上述操作步骤为后续的文本生成任务做好了准备，包括加载数据集、确保有适当的权限访问所需的模型，以及指定将要使用的模型。接下来的步骤可能包括创建文本生成的管道，以及使用该管道生成或处理文本数据。

9.3.2 管道（pipeline）

管道（pipeline）是生成文本摘要、续写文本、生成回复等任务的基础，包括分词器、模型本身和控制输出的参数。

（1）使用Hugging Face中的库transformers创建了一个文本生成的管道（pipeline），它将使用Gemma 2B模型，并配置了一些关键参数以优化性能和输出。这个管道可以用于生成文本摘要、续写文本、生成回复等任务。

pipe = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16},
    device='cuda',
    max_new_tokens=512
)

在本项目中，Pipelines 提供了一种高效且用户友好的方式来利用模型进行推断。具体来说，Pipelines包括以下组成部分：

一个分词器（tokenizer），如果没有明确指定，会自动从 HuggingFace 上的模型配置中导入。
模型本身。
用于控制和微调输出的参数。

在上面的代码中已经配置了和上述组成有关的参数：

max_new_tokens：控制生成的最大新令牌数。如果未指定，可能默认值不足以生成足够的文本（因此不足以生成摘要）。
model_kwargs：使用 torch.float16 控制精度，这样做对内存有益。

执行后会输出：

2024-04-14 23:42:21.932731: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-14 23:42:21.932785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-14 23:42:21.934231: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

（2）从数据集中导入第一个写作并显示前1000个字符，告知用户这段文本的总字符数。这样可以作为文本生成任务的输入，例如生成摘要或续写文本等。由于文本可能很长，只展示前1000个字符可以让用户快速了解文本的开头部分，同时避免一次性处理过多数据导致的性能问题。

writeup = writeups.iloc[0, 9]
print('Number of characters:', len(writeup))
print(writeup[:1000])

对上述代码的具体说明如下所示：

writeup = writeups.iloc[0, 9]：从writeups数据集中选取第一行（iloc[0, ...]表示按索引位置选择行和列）的第10列（因为索引是从0开始的，所以9表示第10列）的数据，并将其赋值给变量writeup。这里假设数据集中的第10列包含了要进行文本生成的文本内容。
print('Number of characters:', len(writeup))：打印出变量writeup中文本的字符数量。使用len()函数可以计算字符串的长度，即它包含的字符数。
print(writeup[:1000])：打印出writeup字符串的前1000个字符。字符串切片[:1000]用于获取从字符串开头到第999个字符的部分（Python的索引是从0开始的，所以1000个字符实际上是取到第999个字符）。

执行后会输出：

Number of characters: 9864
<h2>TLDR</h2>
<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.</p>
<p>We used only competition data.</p>
<h2>1. Data Preprocessing</h2>
<h3>1.1 CNN Preprocessing</h3>
<ul>
<li>We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points.</li>
<li>During training, we applied various augmentations.</li>
<li>We implemented standard normalization.</li>
<li>Instead of dropping NaN values, we filled them with zeros after normalization.</li>
<li>We interpolated the time axis to a siz

注意：上面代码假设数据集中的写入内容是字符串格式，并且每条写入内容的字符数足够多，至少有1000个字符。如果数据集中的文本较短，或者写入内容不是字符串格式，那么这段代码可能不会按预期工作。

（3）使用transformers库中的pipeline生成一个技术性的摘要信息，专注于事实、数字和使用的战略，并以章节、非个人化和使用项目符号的方式组织内容。生成的摘要将被格式化并以Markdown的形式展示，以便于阅读和理解。

messages = [
    {
        "role": "user",
        "content": "Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n{}".format(writeup)
    }
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)

display(Markdown(outputs[0]["generated_text"][len(prompt):].replace('#', '')))

对上述代码的具体说明如下所示：

列表messages：包含了一个字典，其中包含了用户的角色和内容。内容是一个字符串，指示 pipeline 摘要给定的文本，要求摘要在事实、数字和使用的策略方面具有技术性。还要求将摘要分成章节，使用无人称方式，并使用项目符号。
方法pipe.tokenizer.apply_chat_template()：将消息应用于 tokenizer，准备进行生成。参数 tokenize=False 表示不需要进行分词，而 add_generation_prompt=True 表示添加生成的提示。
方法pipe()：使用创建的 pipeline 对根据提示生成文本摘要。参数 do_sample=True 表示采用随机采样方式生成文本，temperature=0.1 控制了采样的温度，top_k=20 和 top_p=0.3 控制了生成结果的多样性。
方法display(Markdown())：用于在输出中显示Markdown 格式的文本，outputs[0]["generated_text"] 包含了生成的文本，通过切片去除了生成的提示，并替换了可能存在的标题以避免与相关的目录冲突。

执行后会格式化输出Markdown文本的内容：

Chapter 1: Introduction

Overview of the project: using an EfficientNet-B0 model for lip and pose classification.

Data preparation:

18 lip points, 20 pose points, and all hand points were extracted.

Various augmentations and transformer pre-processing were applied.

The input size was 160x80x3.

Chapter 2: Data Preprocessing

CNN pre-processing:

Global affine, shift-scale-rotate, and flip pose were applied.

Mixup augmentation was used for CNNs.

Transformer pre-processing:

Only 61 points were kept, including 40 lip points and 21 hand points.

Randomly selected distances and angles were included.

Chapter 3: Training

CNN training:

One-fold cross-validation with a random split and 0.1 warm-up.

Weighted cross-entropy loss with class weights.

EfficientNet-B0 with 5 blocks and 256 hidden units.

Transformer training:

One-fold cross-validation with a random split and 0.1 warm-up.

Ranger optimizer with 60% flat and 40% cosine annealing learning rate schedule.

4-layer, 256 hidden-size, 512 intermediate-size transformer.

Chapter 4: Hyperparameter Tuning

Optuna was used to tune most parameters.

The parameters list for CNN and transformer training are provided.

Chapter 5: Submissions and Ensemble

EfficientNet-B0 achieved a leaderboard score of approximately 0.8.

Ensemble of EfficientNet-B0, BERT, and DeBERTa was created.

A key feature was using the ensemble without softmax, which provided a boost of around 0.01.

Chapter 6: Conclusion

The project achieved a high accuracy on the lip and pose classification task.

The EfficientNet-B0 model with ensemble achieved the best performance.

The conversion of DepthwiseConv2D operation was a challenge, but a faster version was developed.

码农三叔

关注

17
点赞
踩
25

收藏

觉得还不错? 一键收藏
打赏
0
评论
（21-2）基于Gemma 2B模型的智能文本摘要系统：系统设置

（3）使用transformers库中的pipeline生成一个技术性的摘要信息，专注于事实、数字和使用的战略，并以章节、非个人化和使用项目符号的方式组织内容。在项目开始之初需要配置好环境，准备好项目所用到的库、数据和Gemma模型，并通过Hugging Face的transformers库创建一个文本生成的管道，配置管道的相关参数如最大新令牌数和精度控制，以优化模型性能和输出。上述操作步骤为后续的文本生成任务做好了准备，包括加载数据集、确保有适当的权限访问所需的模型，以及指定将要使用的模型。
复制链接

扫一扫