(21-7-01)基于Gemma 2B模型的智能文本摘要系统:微调(1)

9.7  微调

在现实应用中,当预训练模型(如Gemma)在特定领域的表现不佳,或者需要针对低资源语言、敏感数据、成本和隐私限制等特定需求时,此时需要对模型进行微调。本项目将使用一种节省资源的微调方法——参数高效微调(Parameter Efficient Fine-Tuning),简称PEFT。

9.7.1  参数高效微调

参数高效微调(Parameter-Efficient Fine-Tuning,简称PEFT)是一类用于调整预训练模型以适应特定任务的技术,不需要对整个模型的参数进行重新训练。这种方法特别适用于希望减少计算资源消耗和训练时间的场景。以PEFT的主要特点如下所示。

  1. 参数共享:PEFT通常涉及在预训练模型的基础上添加或调整一小部分参数,而保留大部分原始参数不变。
  2. 适应性:通过微调,模型可以更好地适应特定的数据集或任务,即使这些任务与预训练时的任务不同。
  3. 计算效率:因为只有一小部分参数被调整,所以PEFT比传统的全参数微调更节省计算资源。

参数高效微调是一种在保持预训练模型知识的同时,对其进行快速、高效定制的方法。这使得即使是在资源受限的环境中,也能充分利用大型预训练模型的能力。

9.7.2  LoRa微调

LoRA(Low-Rank Adaptation)是实现PEFT的一种技术,通过在模型的权重矩阵中引入低秩结构的方式进行参数调整。LoRA方法特别适用于大型模型,因为它可以显著减少所需的参数数量和计算量。

(1)下面代码执行了内存清理的操作,以优化系统资源,特别是在资源受限或需要释放不再使用的内存时很有用。

langchain_hf = release_memory(langchain_hf)

with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()

执行后会输出:

0

通过执行上述操作,可以有效地管理内存使用,特别是在运行资源密集型的深度学习模型时,如微调大型语言模型(LLM)。这些清理工作有助于避免内存泄漏问题,确保可以继续运行应用程序或模型训练工作,不会因为内存不足而崩溃。

(2)从指定的 CSV 文件路径导入一个验证数据集,该数据集包含两列:“article”和“highlights”。导入的数据集被用于机器学习任务中的模型验证,以评估模型在未见过的文本数据上的性能。并且还展示了数据集的前几行内容,以便快速检查数据的格式和内容。

validation = pd.read_csv('input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv')[['article', 'highlights']]

validation.head()

执行后会输出:

     article	                               highlights
0	Sally Forrest, an actress-dancer who graced th...	Sally Forrest, an actress-dancer who graced th...
1	A middle-school teacher in China has inked hun...	Works include pictures of Presidential Palace ...
2	A man convicted of killing the father and sist...	Iftekhar Murtaza, 29, was convicted a year ago...
3	Avid rugby fan Prince Harry could barely watch...	Prince Harry in attendance for England's crunc...
4	A Triple M Radio producer has been inundated w...	Nick Slater's colleagues uploaded a picture to...

(3)配置一个用于因果语言建模(Causal Language Modeling,简称CLM)的预训练模型,是使用低秩适配(LoRA)和位和字节(BitsAndBytes,简称BnB)量化技术进行参数高效微调的Gemma模型。

model = "input/gemma/transformers/2b-it/3"
lora_config = LoraConfig(
    r=6,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.padding_side = "right" # Fixing overflow issue ref: source code
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=bnb_config)

执行后会输出:

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(4)训练数据集中的每个样本创建格式化的提示(prompt),这些提示将用于训练或微调语言模型,使其能够生成文章的摘要。

train_data = Dataset.from_pandas(validation)

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['article'])):
        messages = [
            {"role": "user",
             "content": "Given the following article, write a short summary of the article in 2-3 sentences:\n\nArticle: {}".format(example['article'][i])},
            {"role": "assistant",
             "content": "{}".format(example['highlights'][i])}
        ]
        output_texts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False))
        
    return output_texts

print(formatting_prompts_func(train_data[:1])[0])

执行后会输出:

<bos><start_of_turn>user

Given the following article, write a short summary of the article in 2-3 sentences:



Article: Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday. Scroll down for video . Actress: Sally Forrest was in the 1951 Ida Lupino-directed film 'Hard, Fast and Beautiful' (left) and the 1956 Fritz Lang movie 'While the City Sleeps' A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films including the critical and commercial success Not Wanted, Never Fear and Hard, Fast and Beautiful. Some of Forrest's other film credits included Bannerline, Son of Sinbad, and Excuse My Dust, according to her iMDB page. The page also indicates Forrest was in multiple Climax! and Rawhide television episodes. Forrest appeared as herself in an episode of The Ed Sullivan Show and three episodes of The Dinah Shore Chevy Show, her iMDB page says. She also starred in a Broadway production of The Seven Year Itch. City News Service reported that other stage credits included As You Like It, No, No, Nanette and Damn Yankees. Forrest married writer-producer Milo Frank in 1951. He died in 2004. She is survived by her niece, Sharon Durham, and nephews, Michael and Mark Feeney. Career: A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films .<end_of_turn>

<start_of_turn>model

Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films died on March 15 .

Forrest, whose birth name was Katherine Feeney, had long battled cancer .

A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films .<end_of_turn>

上述格式化代码非常重要,因为它确保了模型接收到的输入是结构化的,并且按照模型预期的方式引导模型生成输出。在这种情况下,格式化的提示模拟了聊天场景,其中用户提交一篇文章后,系统会返回一个对应的摘要。

未完待续

  • 21
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值