动手学习RAG: 大模型向量模型微调 intfloat/e5-mistral-7b-instruct

YueTann

已于 2024-09-18 10:43:30 修改

阅读量535

点赞数 9

分类专栏：动手学习RAG 文章标签：学习

于 2024-09-18 10:39:50 首次发布

本文链接：https://blog.csdn.net/weixin_38812492/article/details/142248177

版权

动手学习RAG 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

在这里插入图片描述

1. 环境准备

pip install transformers
pip install open-retrievals

注意安装时是pip install open-retrievals，但调用时只需要import retrievals
欢迎关注最新的更新 https://github.com/LongxingTan/open-retrievals

2. 使用Mistral作为向量模型

这里直接将query_instruction和document_instruction写进了text里

from retrievals import AutoModelForEmbedding

model_name = 'intfloat/e5-mistral-7b-instruct'
model = AutoModelForEmbedding.from_pretrained(
            model_name,
            pooling_method='last',
            use_fp16=True,
        )

texts = [
'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: how much protein should a female eat', 
'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: summit define', 
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", 
'Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.'
]

embeds = model.encode(texts, normalize_embeddings=True)
print(embeds)

scores = (embeds[:2] @ embeds[2:].T) * 100
print(scores.tolist())

请添加图片描述

也可以把prompt写在函数中

from retrievals import AutoModelForEmbedding

model_name = 'intfloat/e5-mistral-7b-instruct'
model = AutoModelForEmbedding.from_pretrained(
            model_name,
            pooling_method='last',
            use_fp16=True,
            query_instruction='Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ',
            document_instruction='',
        )


query_texts = ['how much protein should a female eat', 'summit define']
document_texts = ["As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", 'Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.']

query_embeds = model.encode(query_texts, normalize_embeddings=True, is_query=True)
print(query_embeds)

doc_embeds = model.encode(document_texts, normalize_embeddings=True, is_query=False)
print(doc_embeds)

scores = (query_embeds @ doc_embeds.T) * 100
print(scores.tolist())

3. LoRA微调E5-mistral向量模型

数据还是按照惯例采用t2-ranking

MODEL_NAME="intfloat/e5-mistral-7b-instruct"
TRAIN_DATA="/root/kag101/src/open-retrievals/t2/t2_ranking.jsonl"
OUTPUT_DIR="/root/kag101/src/open-retrievals/t2/ft_out"


torchrun --nproc_per_node 1 \
  -m retrievals.pipelines.embed \
  --output_dir $OUTPUT_DIR \
  --overwrite_output_dir \
  --model_name_or_path $MODEL_NAME \
  --pooling_method last \
  --do_train \
  --data_name_or_path $TRAIN_DATA \
  --positive_key positive \
  --negative_key negative \
  --use_lora True \
  --query_instruction 'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ' \
  --document_instruction '' \
  --learning_rate 1e-5 \
  --bf16 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16 \
  --dataloader_drop_last True \
  --query_max_length 64 \
  --document_max_length 256 \
  --train_group_size 2 \
  --logging_strategy steps \
  --logging_steps 100 \
  --temperature 0.02 \
  --use_inbatch_negative false \
  --save_total_limit 1

请添加图片描述

由于trainer中可以使用多种方式使用多GPU，因此retrievals也都支持。

# torchrun --nnodes 1 --nproc-per-node 4
# deepspeed --include localhost:0,1,2,3
# CUDA_VISIBLE_DEVICES=1,2,3 python
# accelerate launch --config_file conf_ds.yaml \

accelerate launch \
    --config_file conf_llm.yaml \
    llm_finetune_for_embed.py \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --train_data  \
    --output_dir output \