p-tuning v2 微调 chatGLM实战

最新推荐文章于 2024-08-22 23:59:32 发布

qq_22544887

最新推荐文章于 2024-08-22 23:59:32 发布

阅读量316

点赞数 10

文章标签：自然语言处理人工智能 python llama 语言模型

本文链接：https://blog.csdn.net/qq_22544887/article/details/140874287

版权

环境安装

!pip install rouge_chinese nltk jieba datasets

Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: rouge_chinese in ./miniconda3/lib/python3.10/site-packages (1.0.3)

…

chatGLM框架部署b

!git clone https://gitee.com/952202/ChatGLM2-6B.git
#git clone https://github.com/THUDM/ChatGLM2-6B.git

fatal: destination path 'ChatGLM2-6B' already exists and is not an empty directory.

注意，这里要求 transformers==4.30.2 但这个版本无法识别qwen2 模型，如果确认是qwen1版本，则不需要进行下面的修改
先安装 pip install transformers==4.37.0 或者把 requirements.txt 中的transformers版本改成这 4.37.0
执行vim $CONDA_PREFIX/lib/python3.10/site-packages/transformers/modeling_utils.py

safe_serialization 改成 False

def save_pretrained(
        self,
        save_directory: Union[str, os.PathLike],
        is_main_process: bool = True,
        state_dict: Optional[dict] = None,
        save_function: Callable = torch.save,
        push_to_hub: bool = False,
        max_shard_size: Union[int, str] = "5GB",
        safe_serialization: bool = True,
        variant: Optional[str] = None,
        token: Optional[Union[str, bool]] = None,
        save_peft_format: bool = True,
        **kwargs,
    ):
    ...

%cd ChatGLM2-6B
!pip install -r requirements.txt

[Errno 2] No such file or directory: 'ChatGLM2-6B'
/data/ChatGLM2-6B


/data/miniconda3/lib/python3.10/site-packages/IPython/core/magics/osm.py:393: UserWarning: using bookmarks requires you to install the `pickleshare` library.
  bkms = self.shell.db.get('bookmarks', {})


Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: protobuf in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (3.20.3)
Collecting transformers==4.30.2 (from -r requirements.txt (line 2))
  Downloading https://mirrors.aliyun.com/pypi/packages/5b/0b/e45d26ccd28568013523e04f325432ea88a442b4e3020b757cf4361f0120/transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m682.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: cpm_kernels in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 3)) (1.0.11)
Requirement already satisfied: torch>=2.0 in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (2.3.0)
Requirement already satisfied: gradio in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (4.26.0)
Requirement already satisfied: mdtex2html in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (1.3.0)
Requirement already satisfied: sentencepiece in /data/miniconda3/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (0.2.0)
......
[0mSuccessfully installed tokenizers-0.13.3 transformers-4.30.2

处理训练数据集

import json

instruction="请对句子进行评价分类(正向评价,负向评价,中性评价): "
# 读取JSON文件
with open('/data/train_data/训练集示例.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# 遍历每个条目，在content的值前面加上指定的字符串
for item in data:
    item['content'] = instruction + item['content']

# 将更新后的数据写回JSON文件
with open('/data/train_data/formatted_train_data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

模型训练

%cd ptuning
!pwd

/data/ChatGLM2-6B/ptuning
/data/ChatGLM2-6B/ptuning

# 禁用 wandb
import os
os.environ["WANDB_DISABLED"] = "true"

–max_steps 1000 训练步数，要根据训练集的记录数评估下,如果需要epochs=3 应该是训练集的总记录数*3
–save_steps 500 这个步数要能被max_steps整除
–model_name_or_path 模型路径
–output_dir 训练输出路径，推理时要用
–learning_rate 其它模型我习惯设到 1e-4 ，但glm可能不太一样，缺省设的2e-2
–prompt_column content 这是训练集中的字段名
–response_colum summary 这是训练集中的字段名

!CUDA_VISIBLE_DEVICES=1 torchrun --standalone --nnodes=1 --nproc-per-node=1 main.py \
    --do_train \
    --train_file /data/train_data/formatted_train_data.json \
    --preprocessing_num_workers 10 \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /models/chatglm2-6b \
    --output_dir output/adgen-chatglm2-6b-pt-128-2e-2 \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 128 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --max_steps 1000 \
    --logging_steps 500 \
    --save_steps 100 \
    --learning_rate 2e-2 \
    --pre_seq_len 128\
    --quantization_bit 4

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
08/02/2024 05:23:17 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/02/2024 05:23:17 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.02,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=output/adgen-chatglm2-6b-pt-128-2e-2/runs/Aug02_05-23-17_gpuserver01,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=1000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=output/adgen-chatglm2-6b-pt-128-2e-2,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=output/adgen-chatglm2-6b-pt-128-2e-2,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
08/02/2024 05:25:17 - WARNING - datasets.arrow_dataset - num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
Running tokenizer on train dataset (num_proc=2): 100%|█| 2/2 [00:00<00:00,  4.74
input_ids [64790, 64792, 790, 30951, 517, 30910, 30939, 30996, 13, 13, 54761, 31211, 55073, 54570, 39099, 31636, 32581, 33328, 30946, 46163, 32581, 30932, 55176, 54759, 32581, 30932, 51102, 32581, 1528, 34346, 54999, 54573, 54542, 33847, 54530, 54574, 54652, 34204, 31642, 34213, 31123, 33969, 42524, 31123, 31838, 31876, 32884, 31155, 13, 13, 55437, 31211, 30910, 46163, 32581, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
inputs [Round 1]

问：请对句子进行评价分类(正向评价,负向评价,中性评价): 一百多和三十的也看不出什么区别，包装精美，质量应该不错。

答： 正向评价
label_ids [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 30910, 46163, 32581, 2, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
labels 正向评价
[INFO|trainer.py:577] 2024-08-02 05:25:19,287 >> max_steps is given, it will override any value given in num_train_epochs
/data/miniconda3/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1786] 2024-08-02 05:25:19,912 >> ***** Running training *****
[INFO|trainer.py:1787] 2024-08-02 05:25:19,912 >>   Num examples = 2
[INFO|trainer.py:1788] 2024-08-02 05:25:19,912 >>   Num Epochs = 1,000
[INFO|trainer.py:1789] 2024-08-02 05:25:19,912 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1790] 2024-08-02 05:25:19,912 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1791] 2024-08-02 05:25:19,912 >>   Gradient Accumulation steps = 16
[INFO|trainer.py:1792] 2024-08-02 05:25:19,912 >>   Total optimization steps = 1,000
[INFO|trainer.py:1793] 2024-08-02 05:25:19,913 >>   Number of trainable parameters = 1,835,008
  0%|                                                  | 0/1000 [00:00<?, ?it/s]08/02/2024 05:25:20 - WARNING - transformers_modules.chatglm2-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/data/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
 10%|████                                    | 100/1000 [00:37<05:33,  2.70it/s]Saving PrefixEncoder
{'loss': 0.0001, 'learning_rate': 0.0, 'epoch': 1000.0}                         
100%|███████████████████████████████████████| 1000/1000 [06:01<00:00,  2.82it/s]Saving PrefixEncoder
[INFO|configuration_utils.py:458] 2024-08-02 05:31:21,906 >> Configuration saved in output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000/config.json
[INFO|configuration_utils.py:364] 2024-08-02 05:31:21,907 >> Configuration saved in output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000/generation_config.json
[INFO|modeling_utils.py:1853] 2024-08-02 05:31:21,915 >> Model weights saved in output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2194] 2024-08-02 05:31:21,916 >> tokenizer config file saved in output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2201] 2024-08-02 05:31:21,916 >> Special tokens file saved in output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000/special_tokens_map.json
[INFO|trainer.py:2053] 2024-08-02 05:31:21,955 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 362.0417, 'train_samples_per_second': 44.194, 'train_steps_per_second': 2.762, 'train_loss': 0.015981700539588927, 'epoch': 1000.0}
100%|███████████████████████████████████████| 1000/1000 [06:02<00:00,  2.76it/s]
***** train metrics *****
  epoch                    =     1000.0
  train_loss               =      0.016
  train_runtime            = 0:06:02.04
  train_samples            =          2
  train_samples_per_second =     44.194
  train_steps_per_second   =      2.762

模型推理

from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch

# 载入Tokenizer
tokenizer = AutoTokenizer.from_pretrained("/models/chatglm2-6b", trust_remote_code=True)

config = AutoConfig.from_pretrained("/models/chatglm2-6b", trust_remote_code=True, pre_seq_len=128)
model = AutoModel.from_pretrained("/models/chatglm2-6b", config=config, trust_remote_code=True).half().cuda()

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]


Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /models/chatglm2-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

训练的checkpoint，应该是 output_dir 路径下的最后一个checkpoin-？？？路径

prefix_state_dict = torch.load(os.path.join("output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-1000", "pytorch_model.bin"))
new_prefix_state_dict = {}
for k, v in prefix_state_dict.items():
    if k.startswith("transformer.prefix_encoder."):
        new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)

<All keys matched successfully>

model = model.eval()

response, history = model.chat(tokenizer, "你好", history=[])

print(response)

你好，有什麼我可以幫忙的嗎？

import json

# 读取JSON文件
with open('/data/train_data/测试集示例.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# 遍历每个条目，调用LLM模型会话，更新summary
for item in data:
    content = instruction + item['content']
    response, history = model.chat(tokenizer, content, history=[])
    item['summary'] = response

# 将更新后的数据写回JSON文件
with open('/data/train_data/output.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

qq_22544887

关注

10
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
p-tuning v2 微调 chatGLM实战

注意，这里要求 transformers==4.30.2 但这个版本无法识别qwen2 模型，如果确认是qwen1版本，则不需要进行下面的修改。训练的checkpoint，应该是 output_dir 路径下的最后一个checkpoin-？或者把 requirements.txt 中的transformers版本改成这 4.37.0。safe_serialization 改成 False。
复制链接

扫一扫