大模型的 Embedding 模型该如何进行微调?

节前,我们星球组织了一场算法岗技术&面试讨论会,邀请了一些互联网大厂朋友、参加社招和校招面试的同学.

针对算法岗技术趋势、大模型落地项目经验分享、新手如何入门算法岗、该如何准备、面试常考点分享等热门话题进行了深入的讨论。

汇总合集:《大模型面试宝典》(2024版) 发布!


本文将会介绍如何使用 Sentence Transformers 对开源的Embedding模型bge-base-zh-v1.5进行微调,并验证 Embedding 模型微调后的效果。

在RAG框架或者语义相似度计算任务时,Embedding模型是我们常常会打交道的模型。

Sentence Transformers 是一个 Python 库,用于使用和训练各种应用的Embedding模型,例如检索增强生成 (RAG)、语义搜索、语义文本相似度、释义挖掘 (paraphrase mining) 等等。其 3.0 版本的更新是该工程自创建以来最大的一次,引入了一种新的训练方法。

本文将会以智源研究院(BAAI)开源的Embedding模型bge-base-zh-v1.5作为基准模型,展示如何使用Sentence Transformers进行评估,并对其进行微调,验证微调后的模型效果会有所提升。

评估指标Baseline

使用LlamaIndex框架对RAG流程中的各种Retrieve算法,包括Embedding模型召回,进行了评估,评估指标采用Hit RateMRR。本文将继续使用这篇文章中给出的数据集进行评估。

示例评估代码如下:

# -*- coding: utf-8 -*-
# @file: bge_base_zh_eval.py
import os
import json
import time
import torch
from pprint import pprint
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim

project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]

# data process
# load dataset, get corpus, queries, relevant_docs
with open(os.path.join(project_dir, "data/doc_qa.json"), "r", encoding="utf-8") as f:
    content = json.loads(f.read())

corpus = content['corpus']
queries = content['queries']
relevant_docs = content['relevant_docs']

# # Load a model
# 替换成自己的模型完整路径或使用huggingface modl id
model_name = "bge-base-zh-v1.5"
model_path = os.path.join(project_dir, f"models/{model_name}")
model = SentenceTransformer(model_path, device="cuda" if torch.cuda.is_available() else "cpu")
print("Model loaded")

s_time = time.time()

# # Evaluate the model
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name=f"{os.path.basename(model_path)}",
    score_functions={"cosine": cos_sim}
)

# Evaluate the model
result = evaluator(model)
pprint(result)
print(f"Time cost: {time.time() - s_time:.2f}s")

我们在评估器中传入queries, corpus, relevant_docs字典,加载完模型后即可进行评估。

评估结果在下文中给出,作为baseline(基准)指标。

技术交流群

前沿技术资讯、算法交流、求职内推、算法竞赛、面试交流(校招、社招、实习)等、与 10000+来自港科大、北大、清华、中科院、CMU、腾讯、百度等名校名企开发者互动交流~

我们建了大模型算法岗技术与面试交流群, 想要交流、需要源码&资料、提升技术的同学,可以直接加微信号:mlc2060。加的时候备注一下:研究方向 +学校/公司+CSDN,即可。然后就可以拉你进群了。

方式①、微信搜索公众号:机器学习社区,后台回复:加群
方式②、添加微信号:mlc2060,备注:CSDN + 技术交流

微调数据合成

LlamaIndex框架中,可方便地使用generate_qa_embedding_pairs方法,利用Prompt工程对文本生成相关问题并进行关联。

Embedding模型的微调数据合成脚本如下:

# -*- coding: utf-8 -*-
# @file: make_ft_corpus.py
import os
from llama_index.legacy.finetuning import (
    generate_qa_embedding_pairs
)
from llama_index.llms.openai import OpenAI
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from dotenv import load_dotenv

load_dotenv()

project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]

TRAIN_FILES = [os.path.join(project_dir, "data/ft_train.txt")]
VAL_FILES = [os.path.join(project_dir, "data/ft_test.txt")]

TRAIN_CORPUS_FPATH = os.path.join(project_dir, "data/ft_train_corpus.json")
VAL_CORPUS_FPATH = os.path.join(project_dir, "data/ft_val_corpus.json")


def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter(chunk_size=250, chunk_overlap=0)
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes


train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

llm = OpenAI(model="gpt-3.5-turbo", api_key=os.getenv("OPENAI_API_KEY"))

qa_generate_prompt_tmpl = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination in Chinese. The questions should be diverse in nature \
across the document in Chinese. The questions should not contain options, not start with Q1/ Q2. \
Restrict the questions to the context information provided.
"""

train_dataset = generate_qa_embedding_pairs(nodes=train_nodes, llm=llm, num_questions_per_chunk=1, qa_generate_prompt_tmpl=qa_generate_prompt_tmpl)
val_dataset = generate_qa_embedding_pairs(nodes=val_nodes, llm=llm, num_questions_per_chunk=1, qa_generate_prompt_tmpl=qa_generate_prompt_tmpl)

train_dataset.save_json(TRAIN_CORPUS_FPATH)
val_dataset.save_json(VAL_CORPUS_FPATH)

输出结果如下:

Output:

Loading files ['/Users/admin/PycharmProjects/embedding_model_exp/data/ft_train.txt']
Loaded 1 docs
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 23.54it/s]
Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]Parsed 137 nodes
Loading files ['/Users/admin/PycharmProjects/embedding_model_exp/data/ft_test.txt']
Loaded 1 docs
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 45.84it/s]
  0%|          | 0/137 [00:00<?, ?it/s]Parsed 111 nodes
100%|██████████| 137/137 [03:34<00:00,  1.57s/it]
100%|██████████| 111/111 [01:55<00:00,  1.04s/it]

这样,我们就能得到微调数据集了,保存为ft_train_corpus.json和ft_val_corpus.json。

Embedding模型微调

接下来,我们将会对bge-base-zh-v1.5模型进行微调,微调的目的是让模型更适配我们自己的数据集,从而取得更好的召回效果。

使用 `sentence-transformers v3`

这里,我们使用的sentence-transformers模块的版本为V3.0.0。

利用该模块,我们不难实现Embedding模型微调,微调代码如下:

# -*- coding: utf-8 -*-
# @file: ft_sentence_transformers_trainer.py
import os
import json
import time
import torch
from datasets import Dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers import SentenceTransformerTrainer

start_time = time.time()
project_dir = os.path.dirname(os.path.abspath(__file__)).split('/src')[0]

# load eval dataset
with open(os.path.join(project_dir, "data/ft_val_dataset.json"), "r", encoding="utf-8") as f:
    eval_content = json.loads(f.read())

corpus, queries, relevant_docs = eval_content['corpus'], eval_content['queries'], eval_content['relevant_docs']
# load train dataset
with open(os.path.join(project_dir, "data/ft_train_dataset.json"), "r", encoding="utf-8") as f:
    train_content = json.loads(f.read())

train_anchor, train_positive = [], []
for query_id, context_id in train_content['relevant_docs'].items():
    train_anchor.append(train_content['queries'][query_id])
    train_positive.append(train_content['corpus'][context_id[0]])

train_dataset = Dataset.from_dict({"positive": train_positive, "anchor": train_anchor})

print(train_dataset)
print(train_dataset[0:5])

# Load a model
model_name = 'bge-base-zh-v1.5'
# 替换成自己的模型完整路径或使用huggingface modl id
model_path = os.path.join(project_dir, f"models/{model_name}")
model = SentenceTransformer(model_path, device="cuda:0" if torch.cuda.is_available() else "cpu")
print("Model loaded")

# # Evaluate the model
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name=f"{model_name}",
    score_functions={"cosine": cos_sim}
)
train_loss = MultipleNegativesRankingLoss(model)

# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir=f"ft_{model_name}",  # output directory and hugging face model ID
    num_train_epochs=5,  # number of epochs
    per_device_train_batch_size=2,  # train batch size
    gradient_accumulation_steps=2,  # for a global batch size of 512
    per_device_eval_batch_size=4,  # evaluation batch size
    warmup_ratio=0.1,  # warmup ratio
    learning_rate=2e-5,  # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",  # use constant learning rate scheduler
    optim="adamw_torch_fused",  # use fused adamw optimizer
    tf32=True,  # use tf32 precision
    bf16=True,  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    eval_strategy="epoch",  # evaluate after each epoch
    save_strategy="epoch",  # save after each epoch
    logging_steps=10,  # log every 10 steps
    save_total_limit=3,  # save only the last 3 models
    load_best_model_at_end=True,  # load the best model when training ends
    metric_for_best_model=f"eval_{model_name}_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score
)

# train the model
trainer = SentenceTransformerTrainer(
    model=model,    # the model to train
    args=args,      # training arguments
    train_dataset=train_dataset.select_columns(
        ["positive", "anchor"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator
)

trainer.train()
trainer.save_model()
print(f"cost time: {time.time() - start_time:.2f}s")

笔者在1张NVIDIA A800-SXM4-80GB型号的GPU上进行训练,耗时约63.10秒。同时,我们会将微调后的Embedding模型保存在GPU上。

总结

本文重点介绍了如何使用 Sentence Transformers 对开源的Embedding模型bge-base-zh-v1.5进行微调,并验证Embedding模型微调后的效果。

Sentence Transformers 是一个宝库,它介绍了关于Embedding模型方方面面的内容,是了解、深入Embedding模型必不可少的工具。后续笔者将会介绍Embedding模型量化、俄罗斯套娃嵌入模型(Matryoshka Representation Learning, MRL)等相关方面的内容。

参考文献

  1. Training and Finetuning Embedding Models with Sentence Transformers v3: https://huggingface.co/blog/train-sentence-transformers

  2. Fine-tune Embedding models for Retrieval Augmented Generation (RAG): https://www.philschmid.de/fine-tune-embedding-model-for-rag

  3. 俄罗斯套娃 (Matryoshka) 嵌入模型概述: https://huggingface.co/blog/zh/matryoshka

  4. Finetune Embeddings: https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/

  • 32
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是用于微调pkuseg模型的train函数的源代码: ```python import os import time import torch import argparse from tqdm import tqdm from pkuseg.models import BiLSTM_CRF from pkuseg.utils import load_model, load_data, batch_yield, evaluate def train(args): # 加载训练数据和字典 train_data, word2id, tag2id = load_data(args.data_dir, data_type='train') # 加载验证数据 dev_data = load_data(args.data_dir, data_type='dev', word2id=word2id, tag2id=tag2id) # 创建模型 model = BiLSTM_CRF(len(word2id), len(tag2id), args.embed_size, args.hidden_size, dropout=args.dropout, init_embedding=None) # 加载预训练模型 if args.pretrained_model_path: print('Loading pretrained model from {}'.format(args.pretrained_model_path)) model = load_model(model, args.pretrained_model_path) # 将模型移动到GPU上 if torch.cuda.is_available(): model.cuda() # 设置优化器和学习率 optimizer = torch.optim.Adam(model.parameters(), lr=args.lr) # 开始训练 print('Start training...') best_dev_f1 = 0. for epoch in range(args.num_epochs): start_time = time.time() model.train() total_loss = 0. for i, (words, tags) in tqdm(enumerate(batch_yield(train_data, args.batch_size, word2id, tag2id))): # 将数据移动到GPU上 if torch.cuda.is_available(): words = words.cuda() tags = tags.cuda() # 前向传播和计算损失 loss = model.neg_log_likelihood(words, tags) total_loss += loss.item() # 反向传播和更新参数 optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad) optimizer.step() # 计算验证集的F1值 dev_f1 = evaluate(model, dev_data, word2id, tag2id, batch_size=args.batch_size) # 保存最好的模型 if dev_f1 > best_dev_f1: best_dev_f1 = dev_f1 if not os.path.exists(args.save_dir): os.makedirs(args.save_dir) torch.save(model.state_dict(), os.path.join(args.save_dir, 'best_model.pth')) # 打印训练结果 print('Epoch: {}, Loss: {:.4f}, Dev F1: {:.4f}, Time: {:.2f}s'.format(epoch+1, total_loss, dev_f1, time.time()-start_time)) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--data_dir', type=str, default='./data', help='Directory of training data.') parser.add_argument('--pretrained_model_path', type=str, default=None, help='Path of pretrained model.') parser.add_argument('--save_dir', type=str, default='./models', help='Directory to save the trained model.') parser.add_argument('--embed_size', type=int, default=100, help='Dimension of word embedding.') parser.add_argument('--hidden_size', type=int, default=128, help='Dimension of hidden layer.') parser.add_argument('--dropout', type=float, default=0.5, help='Dropout rate.') parser.add_argument('--clip_grad', type=float, default=5.0, help='Clip gradient.') parser.add_argument('--lr', type=float, default=0.001, help='Learning rate.') parser.add_argument('--batch_size', type=int, default=64, help='Batch size.') parser.add_argument('--num_epochs', type=int, default=10, help='Number of epochs.') args = parser.parse_args() train(args) ``` 这个函数的作用是用于在训练数据上微调pkuseg模型。它首先加载训练数据和字典,然后创建一个BiLSTM_CRF模型。如果指定了预训练模型,则加载预训练模型。接下来,设置优化器和学习率,并开始训练。在每个epoch中,模型进行前向传播和计算损失、反向传播和更新参数。在每个epoch结束时,计算验证集的F1值并保存最好的模型。最后,输出训练结果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值