使用Colab免费GPU进行训练
一、安装必要的库
!pip install auto-gptq # 生成GPT提示
!pip install optimum # 提高模型训练效率
!pip install bitsandbytes # 显示模型内部结构、发现模型潜在问题
!pip install torch==2.1
####加载transformers、轻量级模型训练包peft、数据集等包
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers
二、从huggingface(提供了大量NLP相关的开源预训练模型)加载预训练模型
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
# automatically figures out how to best use CPU + GPU for loading model
trust_remote_code=False,
# prevents running custom model files on your machine
revision="main")
# which version of model to use in repo
预训练模型是很大的,而且需要科学上网才能下载。
三、加载分词器,测试基本预训练模型
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model.eval() # 评估模式 (评估模式不会用 dropout)
# 评论
comment = "你写的内容对我很有帮助,谢谢!"
prompt=f'''[INST] {comment} [/INST]'''
# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
# 生成 output/受到max_new_tokens的限制
outputs = model. Generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)
print(tokenizer.batch_decode(outputs)[0])
分词器相比预训练模型要小很多。
四、提示词设定
#####提示词可以使用中文或者英文,但是英文可能速度更快、效果也更好。
intstructions_string = f"""
TikaGPT,functioning as a virtual data science consultant on Social media, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–TikaGPT'. \
TikaGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.
Please translate all the final replies into appropriate Chinese
And please respond to the following comment.
"""
###lambda函数用于拼接提示词和comment
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment}\n[/INST]'''
prompt = prompt_template(comment)
print(prompt)
## 评估
# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)
print(tokenizer.batch_decode(outputs)[0])
提示词也是自定义的,可以自己DIY,进行测试和训练。
五、准备训练模型:用到量化减少训练负担(qlora)
model.train() # model in training mode (dropout modules are activated)
# 模型检查点文件
model.gradient_checkpointing_enable()
# 量化 训练
model = prepare_model_for_kbit_training(model)
# LoRA 参数设定
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# LoRA 模型再训练(peft,轻量化)
model = get_peft_model(model, config)
# 打印可训练参数的数目
model.print_trainable_parameters()
可以看出,相比于原模型,peft模型要训练的参数量大大降低。
六、加载数据、处理数据(此过程相对固定),
# load dataset
data = load_dataset("shawhin/shawgpt-youtube-comments")
# create tokenize function
def tokenize_function(examples):
# extract text
text = examples["example"]
#tokenize and truncate text
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=512
)
return tokenized_inputs
# tokenize training and validation datasets
tokenized_data = data. Map(tokenize_function, batched=True)
# setting pad token
tokenizer.pad_token = tokenizer.eos_token
# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
用于再训练的数据集不算大。
七、fine-tuning模型
# 超参数设定
lr = 2e-4
batch_size = 4
num_epochs = 10
# 定义训练参数
training_args = transformers.TrainingArguments(
output_dir= "Tika-ft",
learning_rate=lr,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.01,
logging_strategy="epoch",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
gradient_accumulation_steps=4,
warmup_steps=2,
fp16=True,
optim="paged_adamw_8bit",
)
参数都可以自己通过训练效果微调。
八、模型再训练
# configure trainer
trainer = transformers.Trainer(
model=model,
train_dataset=tokenized_data["train"],
eval_dataset=tokenized_data["test"],
args=training_args,
data_collator=data_collator
)
# train model
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
# renable warnings
model.config.use_cache = True
可以看出训练的loss和验证集的loss都在下降,只是训练了30个回合,有条件的话可以多训练几个回合。10分钟训练了10个epoch。
九、将模型保存到huggingface中,并利用保存的模型进行实践
from huggingface_hub import notebook_login
notebook_login()
hf_name = 'Xiaodong1' # your hf username or org name
model_id = hf_name + "/" + "Tika-ft"
##加载微调后的模型
# # load model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="main")
config = PeftConfig.from_pretrained("Shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "Shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
##使用:
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)
model.eval()
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])
将模型保存至huggingface中,可方便加载和使用。
关于QloRA、RAG相关的内容可参加另一篇文章:
微调大型语言模型 (LLM) |由 Shawhin Talebi |迈向数据科学 (towardsdatascience.com)
(包含了详细的视屏讲解和文档描述!是很好的资源。)
本文的内容参考自:GitHub: https://github.com/ShawhinT/YouTube-Blog/tree/main/LLMs/qlora