Datawhale AI 夏令营多模态大模型（1）

Leslie X徐

已于 2024-08-11 12:46:57 修改

阅读量615

点赞数 4

分类专栏：大语言模型学习文章标签：人工智能 chatgpt

于 2024-08-09 22:02:19 首次发布

本文链接：https://blog.csdn.net/weixin_44342705/article/details/141071529

版权

大语言模型学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

赛事说明：https://tianchi.aliyun.com/competition/entrance/532251/introduction?spm=a2c22.12281925.0.0.2f307137p8qZmp
学习平台：https://linklearner.com/home

第一天

1.报名赛道学习赛事 https://tianchi.aliyun.com/competition/entrance/532251
2.尝试跑通baseline并打卡 https://linklearner.com/activity/14/13/27

详细描述：
Better Synth 是一项以数据为中心的挑战赛，考察如何合成与清洗图文数据以在多模态大模型上取得更优的图片理解能力。
本次比赛基于 Mini-Gemini 模型进行训练，只关注于预训练（模态间对齐）阶段的数据合成与清洗，指令微调阶段为固定数据集。为了选手更高效地迭代数据合成方案，本次比赛选用 MGM-2B 规模的模型作为比赛模型。
主办方提供候选种子数据集，要求参赛者基于种子数据集进行数据合成与清洗，产出一份基于种子数据集的更高质量、更多样性的数据集，并在给定计算约束下进行训练。主办方提供开发套件，要求参赛者在统一的框架和参数设置下进行模型训练和任务评测，公平对比数据导致的性能差异。数据集产出流程中必须包含“合成”的过程，未包含的方案会被认为是无效方案。

环境安装

使用阿里云创建 DSW实例，可以试用3个月5000算力。然后在魔搭社区 -> 我的Notebook -> 个人云账号授权实例 -> PAI-DSW 创建实例。
镜像链接：dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-training-algorithm/data-juicer-better-synth:0.0.1

若不是镜像链接的则需要安装环境：

SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)

# for data-juicer
echo "[1] Installing toolkit/data-juicer"
cd ${SCRIPT_DIR}/toolkit
git clone https://github.com/modelscope/data-juicer.git
cd data-juicer
pip install ".[all]"

# for MGM training
echo "[2] Installing toolkit/training"
cd ${SCRIPT_DIR}/toolkit/training
pip install -e .
pip install flash-attn --no-build-isolation

echo "Done"

baseline文件解析

input -> pretrain_stage_1_10k：包含10k的图像文字数据对，images中是图像，mgm-pretrain-stage.json中是图像和文字对匹配
models 中存放 BLIP2 模型
toolkit:
- train_mgm_2b_stage_1.sh 训练脚本
- eval 评测数据集
- training 训练数据集和模型
  - data: eval、finetune
  - model_zoo: LLM(Gemma)、Openai(CLIP)
output：
- image_caption_output / res_10k.jsonl
- eval_result
- training_dirs：pretrain_dir、finetune_dir

跑baseline

下载baseline代码： git clone https://www.modelscope.cn/datasets/Datawhale/better_synth_challenge_baseline.git
安装

apt update & apt install axel zip file
pip install modelscope

其中 axel 可以用来加速下载。

下载模型和数据

cd better_synth_challenge_baseline
bash download.sh  ###大概需要50分钟

注意：
速度很慢可以 ctl+C 再重新下，每次会断点续传。
看到有 tar.gz.st 的就是还没下好，需要继续下载。
每次下载完毕后会进行解压缩，此时千万别看着卡住了就 ctl+c ，不然会没有完全解压缩出现问题。

下载的内容：

SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)

# for base models gemma模型
echo "[1] Downloading base models for training..."
mkdir -p ${SCRIPT_DIR}/toolkit/training/model_zoo/LLM/gemma
cd ${SCRIPT_DIR}/toolkit/training/model_zoo/LLM/gemma
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/models/gemma-2b-it.tar.gz
tar zxvf gemma-2b-it.tar.gz

# openai 的 CLIP 模型视觉编码器
mkdir -p ${SCRIPT_DIR}/toolkit/training/model_zoo/OpenAI
cd ${SCRIPT_DIR}/toolkit/training/model_zoo/OpenAI
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/models/clip-vit-large-patch14-336.tar.gz
tar zxvf clip-vit-large-patch14-336.tar.gz
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/models/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup.tar.gz
tar zxvf openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup.tar.gz

# for training data
echo "[2] Downloading seed datasets..."
mkdir -p ${SCRIPT_DIR}/input
cd ${SCRIPT_DIR}/input
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/pretrain_stage_1_10k.tar.gz
tar zxvf pretrain_stage_1_10k.tar.gz
cd pretrain_stage_1_10k
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/mgm_pretrain_stage_1_10k.jsonl
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/stage_1.json

echo "[3] Downloading finetuning datasets..."
mkdir -p ${SCRIPT_DIR}/toolkit/training/data
cd ${SCRIPT_DIR}/toolkit/training/data
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/finetuning_stage_1_12k.tar.gz
tar zxvf finetuning_stage_1_12k.tar.gz
cd finetuning_stage_1_12k
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/mgm_instruction_stage_1_12k.json

# for eval data
echo "[4] Downloading evaluation datasets"
mkdir -p ${SCRIPT_DIR}/toolkit/training/data
cd ${SCRIPT_DIR}/toolkit/training/data
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/eval_stage_1.tar.gz
tar zxvf eval_stage_1.tar.gz

echo "Done"

数据格式 mgm_pretrain_stage_1_10k.jsonl

{"id": "006092514", "text": "<__dj__image>\nthe end part of the minecraft exploits with tmg bond <|__dj__eoc|>", "images": ["images/00609/006092514.jpg"]}
{"id": "003743290", "text": "<__dj__image>\nfloral banner or background with white border - kostenk, vector grat stock fotografie <|__dj__eoc|>", "images": ["images/00374/003743290.jpg"]}
{"id": "001419269", "text": "<__dj__image>\na l c womens wool - blend coat in black <|__dj__eoc|>", "images": ["images/00141/001419269.jpg"]}
{"id": "005736255", "text": "<__dj__image>\nqueen of the coupons sticker <|__dj__eoc|>", "images": ["images/00573/005736255.jpg"]}
{"id": "005738088", "text": "<__dj__image>\ncolores coreline eye and cheek tin <|__dj__eoc|>", "images": ["images/00573/005738088.jpg"]}

下载BLIP模型，用来生成图片的字幕

### 下载BLIP模型，大概需要20分钟
from modelscope import snapshot_download

model_dir = snapshot_download('goldsj/blip2-opt-2.7b', 
                              cache_dir='/mnt/workspace/better_synth_challenge_baseline/models', 
                              revision='master')

使用 data-juice 对数据进行处理：使用BLIP2对图片进行字幕生成
dj-process --config solution/image_captioning.yaml


dataset_path: /mnt/workspace/better_synth_challenge_baseline/input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl
export_path: /mnt/workspace/better_synth_challenge_baseline/output/image_captioning_output/res_10k.jsonl

np: 1
process:
  - image_captioning_mapper:
      hf_img2seq: '/mnt/workspace/better_synth_challenge_baseline/models/goldsj/blip2-opt-2___7b'  # You can replace this path to a local downloaded HF model
      keep_original_sample: false  # we only need the recaptioned captions

输出

{"id":"006092514","text":"<__dj__image> an image of the minecraft logo showing a clock in an open area\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00609\/006092514.jpg"]}
{"id":"003743290","text":"<__dj__image> a colorful set of horizontal banners\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00374\/003743290.jpg"]}
{"id":"001419269","text":"<__dj__image> g-star wash - black wool overcoat in full length\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00141\/001419269.jpg"]}
{"id":"005736255","text":"<__dj__image> queen of the coupst stickers\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00573\/005736255.jpg"]}
{"id":"005738088","text":"<__dj__image> caroline powder blush in red with white labels and logos\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00573\/005738088.jpg"]}
{"id":"000270672","text":"<__dj__image> three donuts made with dark chocolate icing on top of a cooling rack\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00027\/000270672.jpg"]}
{"id":"000704893","text":"<__dj__image> the golden goose super star sneakers with glitter\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00070\/000704893.jpg"]}
{"id":"000888725","text":"<__dj__image> a star wars land in disney world\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00088\/000888725.jpg"]}
{"id":"002873494","text":"<__dj__image> gpu geforce gtx 1070 4gb oc\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00287\/002873494.jpg"]}
{"id":"000933565","text":"<__dj__image> playmobil® playset with toy soldier and green soldiers\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00093\/000933565.jpg"]}
{"id":"000398115","text":"<__dj__image> a beer logo with palm trees and the word goodland written to the right\n <|__dj__eoc|>","images":["\/mnt\/workspace\/better_synth_challenge_baseline\/input\/pretrain_stage_1_10k\/images\/00039\/000398115.jpg"]}

处理完数据后，执行模型的训练和推理
需要修改实例配置，在阿里云平台PAI -> 交互式建模(DSW) -> 变更实例配置，修改为32G显存及以上的GPU规格 (注意二阶段finetune需要至少32G)

然后执行

cd toolkit
git clone https://github.com/modelscope/data-juicer.git
bash train_mgm_2b_stage_1.sh   ### 大概需要3小时

train_mgm_2b_stage_1.sh：这里会执行一个pretrain，二阶段 finetune，然后进行评测

#!/bin/bash
############################################################################
########################### Editable Part Begins ###########################
############################################################################

# exp meta information
EXP_NAME=default
PRETRAIN_DATASET=../output/image_captioning_output/res_10k.jsonl
PRETRAIN_DATASET_IMAGE_PATH=../input/pretrain_stage_1_10k

# training args
# pretraining
# make sure PRETRAIN_BATCH_SIZE_PER_GPU * PRETRAIN_GRADIENT_ACCUMULATION_STEPS * num_gpus = 256
# **NOTICE**: the default setting is for 1 GPU
PRETRAIN_BATCH_SIZE_PER_GPU=4
PRETRAIN_GRADIENT_ACCUMULATION_STEPS=64
PRETRAIN_DATALOADER_NUM_WORKERS=4
# finetuning
# make sure FINETUNE_BATCH_SIZE_PER_GPU * FINETUNE_GRADIENT_ACCUMULATION_STEPS * num_gpus = 128
# **NOTICE**: the default setting is for 1 GPU
FINETUNE_BATCH_SIZE_PER_GPU=4
FINETUNE_GRADIENT_ACCUMULATION_STEPS=32
FINETUNE_DATALOADER_NUM_WORKERS=4
# log and ckpt
LOGGING_STEP=1
CKPT_SAVE_STEPS=100
TOTAL_SAVE_CKPT_LIMIT=1

# inference args
# inference for some benchmarks supports multi-gpus
INFER_CUDA_IDX="0"
############################################################################
############################ Editable Part Ends ############################
############################################################################
SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)

ORIGINAL_DATASET_ALL=$SCRIPT_DIR/../input/pretrain_stage_1_10k/stage_1.json

# check the global size
PRETRAIN_PASS=`python $SCRIPT_DIR/training/preprocess/check_global_batch_size.py $PRETRAIN_BATCH_SIZE_PER_GPU $PRETRAIN_GRADIENT_ACCUMULATION_STEPS 256`
if [ "$PRETRAIN_PASS" = "False" ]; then
    echo "[ERROR] The global batch size of pretraining stage is not 256! Please check and retry."
    exit
fi
FINETUNE_PASS=`python $SCRIPT_DIR/training/preprocess/check_global_batch_size.py $FINETUNE_BATCH_SIZE_PER_GPU $FINETUNE_GRADIENT_ACCUMULATION_STEPS 128`
if [ "$FINETUNE_PASS" = "False" ]; then
    echo "[ERROR] The global batch size of finetuning stage is not 128! Please check and retry."
    exit
fi

# check number of dataset samples
MAX_SAMPLE_NUM=200000
SAMPLED_PRETRAIN_DATASET=$PRETRAIN_DATASET-200k.jsonl
python $SCRIPT_DIR/training/preprocess/check_sample_number.py $PRETRAIN_DATASET $SAMPLED_PRETRAIN_DATASET $MAX_SAMPLE_NUM

# convert dataset from dj format to llava format
PRETRAIN_DATASET_JSON=$SAMPLED_PRETRAIN_DATASET.json
python $SCRIPT_DIR/data-juicer/tools/multimodal/data_juicer_format_to_target_format/dj_to_llava.py $SAMPLED_PRETRAIN_DATASET $PRETRAIN_DATASET_JSON --image_special_token "<__dj__image>" --restore_questions True --original_llava_ds_path $ORIGINAL_DATASET_ALL

# train model
PRETRAIN_NAME=MGM-2B-Pretrain-$EXP_NAME
FINETUNE_NAME=MGM-2B-Finetune-$EXP_NAME
AUX_SIZE=768

NUM_TRAIN_EPOCHS=1
PRETRAIN_SAMPLE_NUM=200000

mkdir -p $SCRIPT_DIR/../output/training_dirs/$PRETRAIN_NAME

# ------------- Pretrain ---------------
deepspeed $SCRIPT_DIR/training/mgm/train/train_mem.py \
    --deepspeed $SCRIPT_DIR/training/scripts/zero2_offload.json \
    --model_name_or_path $SCRIPT_DIR/training/model_zoo/LLM/gemma/gemma-2b-it \
    --version gemma \
    --data_path $PRETRAIN_DATASET_JSON \
    --image_folder $PRETRAIN_DATASET_IMAGE_PATH \
    --vision_tower $SCRIPT_DIR/training/model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux $SCRIPT_DIR/training/model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --mm_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir $SCRIPT_DIR/../output/training_dirs/$PRETRAIN_NAME \
    --num_train_epochs $NUM_TRAIN_EPOCHS \
    --per_device_train_batch_size $PRETRAIN_BATCH_SIZE_PER_GPU \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $PRETRAIN_GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps $CKPT_SAVE_STEPS \
    --save_total_limit $TOTAL_SAVE_CKPT_LIMIT \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps $LOGGING_STEP \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers $PRETRAIN_DATALOADER_NUM_WORKERS \
    --lazy_preprocess True \
    --report_to none \
    2>&1 | tee $SCRIPT_DIR/../output/training_dirs/$PRETRAIN_NAME/pretrain.log

mkdir -p $SCRIPT_DIR/../output/training_dirs/$FINETUNE_NAME

#  ------------- Finetune ---------------
deepspeed $SCRIPT_DIR/training/mgm/train/train_mem.py \
    --deepspeed $SCRIPT_DIR/training/scripts/zero2_offload.json \
    --model_name_or_path $SCRIPT_DIR/training/model_zoo/LLM/gemma/gemma-2b-it \
    --version gemma \
    --data_path $SCRIPT_DIR/training/data/finetuning_stage_1_12k/mgm_instruction_stage_1_12k.json \
    --image_folder $SCRIPT_DIR/training/data/finetuning_stage_1_12k \
    --vision_tower $SCRIPT_DIR/training/model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux $SCRIPT_DIR/training/model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --pretrain_mm_mlp_adapter $SCRIPT_DIR/../output/training_dirs/$PRETRAIN_NAME/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir $SCRIPT_DIR/../output/training_dirs/$FINETUNE_NAME \
    --num_train_epochs $NUM_TRAIN_EPOCHS \
    --per_device_train_batch_size $FINETUNE_BATCH_SIZE_PER_GPU \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $FINETUNE_GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps $CKPT_SAVE_STEPS \
    --save_total_limit $TOTAL_SAVE_CKPT_LIMIT \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps $LOGGING_STEP \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers $FINETUNE_DATALOADER_NUM_WORKERS \
    --lazy_preprocess True \
    --report_to none \
    2>&1 | tee $SCRIPT_DIR/../output/training_dirs/$FINETUNE_NAME/finetuning.log

# inference for submission
# TextVQA
echo "Infer on TextVQA..."
bash $SCRIPT_DIR/eval/textvqa.sh $FINETUNE_NAME $INFER_CUDA_IDX
# MMBench
echo "Infer on MMBench..."
bash $SCRIPT_DIR/eval/mmbench.sh $FINETUNE_NAME "mmbench_dev_20230712" $INFER_CUDA_IDX

# copy this script to output
cp $0 $SCRIPT_DIR/../output/train.sh

# info
echo "Training and Inference done."
echo "Training checkpoints are stored in output/training_dirs/$FINETUNE_NAME."
echo "Inference results are stored in output/eval_results/$FINETUNE_NAME."

输出

Loading extension module cpu_adam...
Time to load cpu_adam op: 27.677916765213013 seconds
  0%|                                                                                        | 0/39 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
{'loss': 13.1191, 'grad_norm': 215.75602308863463, 'learning_rate': 0.0005, 'epoch': 0.03}                          
{'loss': 12.9561, 'grad_norm': 245.24804705387612, 'learning_rate': 0.001, 'epoch': 0.05}                           
{'loss': 1373.125, 'grad_norm': 1152.2338424281413, 'learning_rate': 0.0009981987442712632, 'epoch': 0.08}          
{'loss': 9.3379, 'grad_norm': 22.12235184085656, 'learning_rate': 0.0009928079551738544, 'epoch': 0.1}              
{'loss': 9.4634, 'grad_norm': 22.63307231665722, 'learning_rate': 0.0009838664734667494, 'epoch': 0.13}             
{'loss': 7.2959, 'grad_norm': 11.233604260137353, 'learning_rate': 0.0009714387227305421, 'epoch': 0.15}            
{'loss': 7.2471, 'grad_norm': 9.763044450077903, 'learning_rate': 0.0009556142451940679, 'epoch': 0.18}             
{'loss': 7.2354, 'grad_norm': 12.32727083600074, 'learning_rate': 0.0009365070565805941, 'epoch': 0.2}              
{'loss': 7.1592, 'grad_norm': 5.8093453916971995, 'learning_rate': 0.0009142548246219211, 'epoch': 0.23}            
{'loss': 6.9897, 'grad_norm': 4.417171339272648, 'learning_rate': 0.0008890178771592198, 'epoch': 0.26}             
{'loss': 6.6577, 'grad_norm': 3.8698857264930413, 'learning_rate': 0.0008609780469772622, 'epoch': 0.28}

最后基本在1.5左右
在这里插入图片描述

finetune使用的数据集：
ai2d, ALLaVA-4V, chartqa, coco, docvqa, dvqa, gpt4v-dataset, gqa, llava, ocr_vqa, sam, share-textvqa, vg, web-celebrity, web-landmark, wikiart
上述都是图片文件夹，图片对应的文字问答都在 mgm_instruction_stage_1_12k.json
一些例子
最后终端执行以下进行提交

cd submit

cp -r /mnt/workspace/better_synth_challenge_baseline/solution .

cp -r /mnt/workspace/better_synth_challenge_baseline/output/eval_results output/eval_results/

cp -r /mnt/workspace/better_synth_challenge_baseline/output/train.sh output/

cp /mnt/workspace/better_synth_challenge_baseline/output/training_dirs/MGM-2B-Finetune-image_recaption/finetuning.log output/training_dirs/MGM-2B-Finetune-image_recaption/

cp /mnt/workspace/better_synth_challenge_baseline/output/training_dirs/MGM-2B-Pretrain-image_recaption/pretrain.log output/training_dirs/MGM-2B-Pretrain-image_recaption/

zip -r submit.zip solution output