使用Yival框架微调llama2，并将结果可视化-CSDN博客

本文链接：https://blog.csdn.net/YiVal/article/details/134557622

微调大模型LLMs

微调(Fine-tuning)是指对预训练模型进行定制，使其吸收新知识或专注于特定任务，通过使用新数据进行学习。在大语言模型（LLM）的背景下，微调通常用于将通用的基础模型（例如GPT3）转换为针对特定用例的专业模型，或改进基础模型以适应特定任务。

这种方法的主要优势是模型可以在有限的数据下实现更好的性能。微调LLMs包括使用特定于所需任务的较小数据集调整预训练模型的权重和参数。

YiVal在微调中的作用

YiVal是一个多功能但统一的平台，它通过为各种构建模块提供灵活的评估和微调功能，简化了GenAI应用的开发，这些模块包括模型元数据、参数、提示和检索配置。

在本文中，我们将对llama2和gpt-3.5-turbo之间的英汉翻译能力进行全面评估。为了建立基准，我们将使用yival提供的bertscore评估器。考虑到llama2作为基础模型在中文翻译方面的固有限制，我们将同时使用YiVal生成的数据集对llama2进行微调，以便对增强功能进行比较分析。

步骤

使用GPT-4的YiVal数据生成器生成训练数据
比较llama2和GPT-3.5的英汉翻译能力
使用GPT-4作为辅助对llama2进行微调
重新评估并可视化微调结果

1.生成测验数据

custom_function: demo.translate_quiz.translate_quiz
description: Generated experiment config
dataset:
  data_generators:
    openai_prompt_data_generator:
      chunk_size: 100000
      diversify: true
      # model_name specify the llm model , e.g. a16z-infra/llama-2-13b-chat:9dff94b1bed5af738655d4a7cbcdcde2bd503aa85c94334fe1f42af7f3dd5ee3
      model_name: gpt-4
      prompt:
          "Please provide a concrete and realistic test case as a dictionary for function invocation using the ** operator.
          Only include parameters, excluding description and name.
          Ensure it's succinct and well-structured.
          Only provide the dictionary."
      input_function:
        description:
          The current function is to evaluate the English to Chinese translation ability of the large language model. You will play the role of a teacher, so please provide a coherent English sentence (teacher_quiz), and give the corresponding Chinese translation (teachaer_answer).
        name: translation_english_to_chinese
        parameters:
          teacher_quiz: str
          teacher_answer: str
      expected_param_name: teacher_answer
      number_of_examples: 2
      output_path: english2chinese1.pkl
      call_option:
        temperature: 1.6
        presence_penalty: 2
  source_type: machine_generated

2.微调前比较模型

在微调之前，让我们看看初始的llama2如何完成这项翻译工作。使用YiVal评估器，可以轻松在BertScore、token的使用和延迟，或任意自定义指标方面对llama2与GPT-3.5进行比较。

variations:
  - name : model_name
    variations:
      - instantiated_value: gpt-3.5-turbo
        value: gpt-3.5-turbo
        value_type: str
        variation_id: null

      - instantiated_value: replicate/a16z-infra/llama-2-13b-chat:9dff94b1bed5af738655d4a7cbcdcde2bd503aa85c94334fe1f42af7f3dd5ee3
        value: a16z-infra/llama-2-13b-chat:9dff94b1bed5af738655d4a7cbcdcde2bd503aa85c94334fe1f42af7f3dd5ee3
        value_type: str
        variation_id: null

3.对llama2进行微调和重新评估

我们提供了一种非常简单的方式来微调llama2。微调的数据可以来自满足评估器条件的数据集，也可以来自由GPT-4生成的具有预期值的数据。更多信息，请参阅我们的replicate_finetune存储库。微调过程完成后，您将获得一个指定的model_name，您可以通过复制函数使用此标识符调用微调后的模型。

!poetry run python /content/YiVal/src/yival/dataset/replicate_finetune_utils.py

使用Replicate进行微调

Demo中我们只使用400个示例，并使用Replicate API对10个epoch进行微调。即使数据有限，我们也在llama2的英汉翻译性能上看到了显著的提升。

bertscore-p 0.419 -> 0.445 (+6.2%)

bertscore-r 0.592 -> 0.611 (+3.2%)

bertscore-f 0.489 -> 0.514 (+5.1%)

本地微调结果

我们还使用EMNLP 2020翻译数据集在本地进行了更完整的微调。对于相同的400个测试数据样本，我们取得了更好的性能。

bertscore-p 0.516 -> 0.835 (+61.8%)

bertscore-r 0.653 -> 0.828 (+27.0%)

bertscore-f 0.575 -> 0.831 (+44.5%)

此外，我们看到经过微调后，我们的新llama2模型可以达到与gpt-3.5相当的性能。

结论

微调过程提升了llama2的英汉翻译能力。通过YiVal平台及其微调功能，llama2可以使用GPT-4生成的较小数据集进行训练。评估结果显示，使用bertscore评估器微调后，llama2的性能有显著提升，bertscore-p提高了6.2%，bertscore-r提高了3.2%，bertscore-f提高了5.1%。这表明，即使在数据有限的情况下，针对特定任务对LLM进行微调也是有效的。