特定领域的Embeddings模型微调全面指南

最新推荐文章于 2025-04-14 13:26:42 发布

Python编程杰哥

最新推荐文章于 2025-04-14 13:26:42 发布

阅读量1.7k

点赞数 27

文章标签： easyui 前端 javascript 人工智能 AI-native 1024程序员节知识图谱

本文链接：https://blog.csdn.net/xx_nm98/article/details/143493162

版权

假设你正在为医学领域构建一个问答系统。你希望确保当用户提出问题时，系统能够准确地检索相关的医学文章。但是通用的嵌入模型可能在处理医学术语的高度专业化词汇和细微差别时会遇到困难。

这时候，微调就能派上用场了！！！

在这篇博客文章中，我们将深入探讨为特定领域（如医学、法律或金融）微调嵌入模型的过程。我们将为你所在的领域生成一个特定的数据集，并利用它来训练模型，使其更好地理解该领域内的语言模式和概念。

最终，你将拥有一个针对你的领域优化的更强大的嵌入模型，从而实现更准确的检索并提高你的自然语言处理任务的结果。

理解Embeddings概念

在这里插入图片描述

嵌入（Embeddings）是文本或图像的强大数值表示形式，能够捕捉语义关系。想象一下，一段文本或音频在多维空间中作为一个点存在，其中相似的词或短语彼此之间的距离更近，而不相似的则相距较远。

在这里插入图片描述

嵌入（Embeddings）对于许多自然语言处理任务非常重要，例如：

在这里插入图片描述

语义相似度：找出两张图像或两段文本之间的相似程度。

文本分类：根据文本的意义将数据归类。

问答系统：寻找最相关的文档来回答一个问题。

检索增强生成（RAG）：结合检索的嵌入模型和文本生成的语言模型，以提高生成文本的质量和相关性。

Matryoshka 表征学习

在这里插入图片描述

Matryoshka 表示学习（MRL）是一种用于创建“可截断”嵌入向量的技术。想象一系列套娃，每个套娃内部都包含一个更小的套娃。MRL 以一种方式嵌入文本，使得早期维度（就像外层的套娃）包含最重要的信息，而后续维度则添加更多细节。这使得在需要时可以仅使用嵌入向量的一部分，从而减少存储和计算成本。

在这里插入图片描述

Bge-base-en
由北京人工智能研究院（BAAI）开发的BAAI/bge-base-en-v1.5模型是一个强大的文本嵌入模型。它在各种自然语言处理（NLP）任务中表现出色，并在MTEB和C-MTEB等基准测试中取得了优异成绩。bge-base-en模型对于计算资源有限的应用场景（比如我的情况）来说是一个不错的选择。

为什么微调嵌入模型？
为特定领域微调嵌入模型对于优化检索增强型生成（RAG）系统至关重要。这个过程确保模型对相似性的理解与该领域的具体背景和语言细微差别相匹配。微调后的嵌入模型能够更好地检索问题最相关的文档，从而最终提高RAG系统的准确性和相关性。

数据集格式：为微调打下基础
您可以使用各种数据集格式进行微调。

以下是几种最常见的类型：

正对（Positive Pair）：一对相关的句子（例如，问题和答案）。
三元组（Triplets）：（锚点，正例，反例）三元组，其中锚点与正例相似而与反例不相似。
带有相似度评分的正对（Pair with Similarity Score）：一对带有相似度评分以指示其关系的句子。
带有类别的文本（Texts with Classes）：带有相应类别标签的文本。
在本文中，我们将创建一个由问题和答案组成的数据集来微调我们的bge-base-en-v1.5模型。

损失函数：指导训练过程
损失函数对于训练嵌入模型至关重要。它们衡量模型预测与实际标签之间的差异，提供信号以调整模型权重。

不同的损失函数适合不同的数据集格式：

三元组损失（Triplet Loss）：与（锚点，正例，反例）三元组一起使用，鼓励模型将相似句子放置得更近，不相似的句子放置得更远。
对比损失（Contrastive Loss）：与正对和负对一起使用，鼓励相似句子接近，不相似的句子远离。
余弦相似度损失（Cosine Similarity Loss）：与带有相似度评分的句子对一起使用，鼓励模型生成的嵌入具有与提供的评分匹配的余弦相似度。
套娃损失（Matryoshka Loss）：一种专门设计用于生成可截断的套娃嵌入的损失函数。

代码示例
安装依赖项
我们首先安装必要的库。我们将使用datasets、sentence-transformers和google-generativeai来处理数据集、嵌入模型和文本生成。

apt-get -qq install poppler-utils tesseract-ocr``pip install datasets sentence-transformers google-generativeai``pip install -q --user --upgrade pillow``pip install -q unstructured["all-docs"] pi_heif``pip install -q --upgrade unstructured``pip install --upgrade nltk

我们还将安装unstructured库用于PDF解析和nltk库用于文本处理。

PDF解析与文本提取
我们将使用unstructured库从PDF文件中提取文本和表格。

import nltk``import os` `from unstructured.partition.pdf import partition_pdf``from collections import Counter``nltk.download('punkt')``nltk.download('averaged_perceptron_tagger')``nltk.download('punkt_tab')` `   ``def process_pdfs_in_folder(folder_path):`    `total_text = []  # To accumulate the text from all PDFs``   `    `# Get list of all PDF files in the folder`    `pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]``   `    `for pdf_file in pdf_files:`        `pdf_path = os.path.join(folder_path, pdf_file)`        `print(f"Processing: {pdf_path}")``   `        `# Apply the partition logic`        `elements = partition_pdf(pdf_path, strategy="auto")``   `        `# Display the types of elements`        `display(Counter(type(element) for element in elements))``   `        `# Join the elements to form text and add it to total_text list`        `text = "\n\n".join([str(el) for el in elements])`        `total_text.append(text)``   `    `# Return the total concatenated text`    `return "\n\n".join(total_text)``   ``   ``folder_path = "data"``all_text = process_pdfs_in_folder(folder_path)

我们遍历指定文件夹中的每个PDF文件，并将内容划分为文本、表格和图形。

然后，我们将文本元素组合成单一的文本表示。

自定义文本分块
现在，我们使用nltk将提取出的文本划分为可管理的块。这是为了让文本更适合由大模型（LLM）进行处理所必需的。

import nltk``   ``nltk.download('punkt')``   ``def nltk_based_splitter(text: str, chunk_size: int, overlap: int) -> list:`    `"""`    `Splits the input text into chunks of a specified size, with optional overlap between chunks.``   `    `Parameters:`    `- text: The input text to be split.`    `- chunk_size: The maximum size of each chunk (in terms of characters).`    `- overlap: The number of overlapping characters between consecutive chunks.``   `    `Returns:`    `- A list of text chunks, with or without overlap.`    `"""``   `    `from nltk.tokenize import sent_tokenize``   `    `# Tokenize the input text into individual sentences`    `sentences = sent_tokenize(text)``   `    `chunks = []`    `current_chunk = ""``   `    `for sentence in sentences:`        `# If the current chunk plus the next sentence doesn't exceed the chunk size, add the sentence to the chunk`        `if len(current_chunk) + len(sentence) <= chunk_size:`            `current_chunk += " " + sentence`        `else:`            `# Otherwise, add the current chunk to the list of chunks and start a new chunk with the current sentence`            `chunks.append(current_chunk.strip())  # Strip to remove leading spaces`            `current_chunk = sentence``   `    `# After the loop, if there is any leftover text in the current chunk, add it to the list of chunks`    `if current_chunk:`        `chunks.append(current_chunk.strip())``   `    `# Handle overlap if it's specified (overlap > 0)`    `if overlap > 0:`        `overlapping_chunks = []`        `for i in range(len(chunks)):`            `if i > 0:`                `# Calculate the start index for overlap from the previous chunk`                `start_overlap = max(0, len(chunks[i-1]) - overlap)`                `# Combine the overlapping portion of the previous chunk with the current chunk`                `chunk_with_overlap = chunks[i-1][start_overlap:] + " " + chunks[i]`                `# Append the combined chunk, making sure it's not longer than chunk_size`                `overlapping_chunks.append(chunk_with_overlap[:chunk_size])`            `else:`                `# For the first chunk, there's no previous chunk to overlap with`                `overlapping_chunks.append(chunks[i][:chunk_size])``   `        `return overlapping_chunks  # Return the list of chunks with overlap``   `    `# If overlap is 0, return the non-overlapping chunks`    `return chunks``   ``chunks = nltk_based_splitter(text=all_text, ``                                  chunk_size=2048,`                                  `overlap=0)

数据集生成器
在本节中我们定义两个函数：

提示函数为Google Gemini创建一个提示，请求基于提供的文本片段生成一个问题答案对。

import google.generativeai as genai``import pandas as pd``   ``# Replace with your valid Google API key``GOOGLE_API_KEY = "xxxxxxxxxxxx"``   ``# Prompt generator with an explicit request for structured output``def prompt(text_chunk):`    `return f"""`    `Based on the following text, generate one Question and its corresponding Answer.`    `Please format the output as follows:`    `Question: [Your question]`    `Answer: [Your answer]``   `    `Text: {text_chunk}`    `"""``# Function to interact with Google's Gemini and return a QA pair``def generate_with_gemini(text_chunk:str, temperature:float, model_name:str):`    `genai.configure(api_key=GOOGLE_API_KEY)`    `generation_config = {"temperature": temperature}``   `    `# Initialize the generative model`    `gen_model = genai.GenerativeModel(model_name, generation_config=generation_config)``   `    `# Generate response based on the prompt`    `response = gen_model.generate_content(prompt(text_chunk))``   `    `# Extract question and answer from response using keyword`    `try:`        `question, answer = response.text.split("Answer:", 1)`        `question = question.replace("Question:", "").strip()`        `answer = answer.strip()`    `except ValueError:`        `question, answer = "N/A", "N/A"  # Handle unexpected format in response``   `    `return question, answer

generate_with_gemini 函数与 Gemini 模型交互，并使用创建的提示生成问答对。

运行问答生成
使用 process_text_chunks 函数，我们为每个文本片段使用 Gemini 模型生成问答对。

def process_text_chunks(text_chunks:list, temperature:int, model_name=str):`    `"""`    `Processes a list of text chunks to generate questions and answers using a specified model.``   `    `Parameters:`    `- text_chunks: A list of text chunks to process.`    `- temperature: The sampling temperature to control randomness in the generated outputs.`    `- model_name: The name of the model to use for generating questions and answers.``   `    `Returns:`    `- A Pandas DataFrame containing the text chunks, questions, and answers.`    `"""`    `results = []``   `    `# Iterate through each text chunk`    `for chunk in text_chunks:`        `question, answer = generate_with_gemini(chunk, temperature, model_name)`        `results.append({"Text Chunk": chunk, "Question": question, "Answer": answer})``   `    `# Convert results into a Pandas DataFrame`    `df = pd.DataFrame(results)`    `return df``# Process the text chunks and get the DataFrame``df_results = process_text_chunks(text_chunks=chunks, ``                                  temperature=0.7,  ``                                 model_name="gemini-1.5-flash")``df_results.to_csv("generated_qa_pairs.csv", index=False)

这些结果然后存储在一个Pandas DataFrame中。

加载数据集
接下来，我们将从CSV文件中生成的问答对加载到HuggingFace数据集中。我们确保数据格式正确，以便进行微调。

from datasets import load_dataset``   ``# Load the CSV file into a Hugging Face Dataset``dataset = load_dataset('csv', data_files='generated_qa_pairs.csv')``   ``def process_example(example, idx):`    `return {`        `"id": idx,  # Add unique ID based on the index`        `"anchor": example["Question"],`        `"positive": example["Answer"]`    `}``dataset = dataset.map(process_example,`                      `with_indices=True ,``                      remove_columns=["Text Chunk", "Question", "Answer"])

加载模型
我们从HuggingFace加载BAAI/bge-base-en-v1.5模型，并确保选择适当的设备进行执行（CPU或GPU）。

import torch``from sentence_transformers import SentenceTransformer``from sentence_transformers.evaluation import (`    `InformationRetrievalEvaluator,`    `SequentialEvaluator,``)``from sentence_transformers.util import cos_sim``from datasets import load_dataset, concatenate_datasets``from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss``   ``   ``model_id = "BAAI/bge-base-en-v1.5"` `   ``# Load a model``model = SentenceTransformer(`    `model_id, device="cuda" if torch.cuda.is_available() else "cpu"``)

定义损失函数
在这里，我们配置套娃（马特罗什卡）损失函数，指定用于截断嵌入的维度。

# Important: large to small``matryoshka_dimensions = [768, 512, 256, 128, 64]` `inner_train_loss = MultipleNegativesRankingLoss(model)``train_loss = MatryoshkaLoss(`    `model, inner_train_loss, matryoshka_dims=matryoshka_dimensions``)

内部损失函数MultipleNegativesRankingLoss有助于模型生成适用于检索任务的嵌入。

定义训练参数
我们使用SentenceTransformerTrainingArguments来定义训练参数。这包括输出目录、训练轮数、批量大小、学习率和评估策略。

from sentence_transformers import SentenceTransformerTrainingArguments``from sentence_transformers.training_args import BatchSamplers``   ``# define training arguments``args = SentenceTransformerTrainingArguments(`    `output_dir="bge-finetuned",                 # output directory and hugging face model ID`    `num_train_epochs=1,                         # number of epochs`    `per_device_train_batch_size=4,              # train batch size`    `gradient_accumulation_steps=16,             # for a global batch size of 512`    `per_device_eval_batch_size=16,              # evaluation batch size`    `warmup_ratio=0.1,                           # warmup ratio`    `learning_rate=2e-5,                         # learning rate, 2e-5 is a good value`    `lr_scheduler_type="cosine",                 # use constant learning rate scheduler`    `optim="adamw_torch_fused",                  # use fused adamw optimizer`    `tf32=True,                                  # use tf32 precision`    `bf16=True,                                  # use bf16 precision`    `batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch`    `eval_strategy="epoch",                      # evaluate after each epoch`    `save_strategy="epoch",                      # save after each epoch`    `logging_steps=10,                           # log every 10 steps`    `save_total_limit=3,                         # save only the last 3 models`    `load_best_model_at_end=True,                # load the best model when training ends`    `metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension``)

注意：如果您使用的是Tesla T4并在训练过程中遇到错误，请尝试注释掉tf32=True和bf16=True这两行以禁用TF32和BF16精度。

创建评估器
我们创建一个评估器来衡量模型在训练期间的表现。评估器使用InformationRetrievalEvaluator来评估模型在Matryoshka损失中的每个维度上的检索性能。

corpus = dict(`    `zip(dataset['train']['id'],``        dataset['train']['positive'])``)  # Our corpus (cid => document)``   ``queries = dict(`    `zip(dataset['train']['id'],``        dataset['train']['anchor'])``)  # Our queries (qid => question)``   ``# Create a mapping of relevant document (1 in our case) for each query``relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])``for q_id in queries:`    `relevant_docs[q_id] = [q_id]``   ``matryoshka_evaluators = []``# Iterate over the different dimensions``for dim in matryoshka_dimensions:`    `ir_evaluator = InformationRetrievalEvaluator(`        `queries=queries,`        `corpus=corpus,`        `relevant_docs=relevant_docs,`        `name=f"dim_{dim}",`        `truncate_dim=dim,  # Truncate the embeddings to a certain dimension`        `score_functions={"cosine": cos_sim},`    `)`    `matryoshka_evaluators.append(ir_evaluator)``   ``# Create a sequential evaluator``evaluator = SequentialEvaluator(matryoshka_evaluators)

微调前评估模型
我们在微调之前评估基础模型以获取性能基准。

results = evaluator(model)``   ``for dim in matryoshka_dimensions:`    `key = f"dim_{dim}_cosine_ndcg@10"`    `print(f"{key}: {results[key]}")

定义训练器
我们创建一个SentenceTransformerTrainer对象，指定模型、训练参数、数据集、损失函数和评估器。

from sentence_transformers import SentenceTransformerTrainer``   ``trainer = SentenceTransformerTrainer(`    `model=model, # our embedding model`    `args=args,  # training arguments we defined above`    `train_dataset=dataset.select_columns(`        `["positive", "anchor"]`    `),`    `loss=train_loss, # Matryoshka loss`    `evaluator=evaluator, # Sequential Evaluator``)

开始微调
调用trainer.train()方法启动微调过程，使用提供的数据和损失函数更新模型的权重。

# start training` `trainer.train()``# save the best model``trainer.save_model()

训练完成后，我们将表现最好的模型保存到指定的输出目录。

微调后的评估
最后，我们加载微调后的模型，并使用相同的评估器来衡量微调后性能的提升。

from sentence_transformers import SentenceTransformer``   ``fine_tuned_model = SentenceTransformer(`    `args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"``)``# Evaluate the model``results = evaluator(fine_tuned_model)``   ``# Print the main score``for dim in matryoshka_dimensions:`    `key = f"dim_{dim}_cosine_ndcg@10"`    `print(f"{key}: {results[key]}")

通过微调特定领域的嵌入模型，你可以让你的自然语言处理（NLP）应用程序在理解和特定领域内的语言和概念方面更进一步，这将显著提升诸如问答、文档检索和文本生成等任务的性能。

本文中讨论的技术，例如利用多语言资源（mrl）和使用强大的模型如bge-base-en，为构建特定领域的嵌入模型提供了切实可行的路径。尽管我们主要专注于微调过程，但请记住，数据集的质量同样至关重要，精心策划一个能够准确反映你领域细微差别的数据集，对于实现最佳结果是必不可少的。

随着自然语言处理领域的不断进步，我们可以期待看到更多强大的嵌入模型和微调策略的出现。通过保持信息灵通并调整你的方法，你可以充分利用嵌入模型的全部潜力，构建符合你特定需求的高质量NLP应用程序。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述