多模态大模型2

最新推荐文章于 2024-08-26 16:32:13 发布

m0_58854572

最新推荐文章于 2024-08-26 16:32:13 发布

阅读量966

点赞数 28

文章标签：数据库

本文链接：https://blog.csdn.net/m0_58854572/article/details/141396184

版权

Datawhale (linklearner.com) 第五期的夏令营快要开始了，全程免费，推荐给想要学习的小伙伴们~

Datajuicer：数据合成

1、Try Data-Juicer in your browser in a JupyterLab.

JupyterHubhttp://8.138.149.181/hub/login?next=%2Fhub%2F%3Fspm%3Da2c22.12281978.0.0.376b10bel5pWsN

2、Run datajuicer by example：

## dj-process --config solution/**.yaml

dataset_path: input/data-of-image-text-pair.jsonl
export_path: output/image_captioning_output.jsonl

process:
- image_captioning_mapper:
hf_img2seq: '/path/to/a/local/downloaded/HF/model'
keep_original_sample: false # we only need the recaptioned captions

data-juicer/configs/config_all.yaml at main · modelscope/data-juicer · GitHubhttps://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml

3、可能有用的算子：（比赛要求所有数据必须经过合成，先用-image_captionaing_mapper处理400k全部数据，在已创建的实例（双卡A800-80G）上需要约16h*12元/小时=190+）

线下赛只能使用10k数据，某人训练想要省钱，早上6点关停了机器，结果97%前功尽弃，这件事情已经不是第一次发生了！多存10快可能就处理完了！

400k数据可以测试不同规模的数据量用同一个算子处理会不会对结果影响不同，在做这个大实验之前可以先试试20k！

即使用大规模的数据进行处理通过了预赛，决赛限制了10k数据，所以早失败有早失败的好处！

- random_selector: # selector to random select samples
select_ratio: # the ratio to be sampled
select_num: # the number to be sampled

  # Selector ops


  - frequency_specified_field_selector:                     
# selector to select samples based on the sorted frequency of specified field value
      field_key: ''                                           
# the target keys corresponding to multi-level field information need to be separated by '.'
      top_ratio:                                              
# ratio of selected top specified field value
      topk:                                                   
# number of selected top specified field value
      reverse: True                                           
# determine the sorting rule, if reverse=True, then sort in descending order
  


  - range_specified_field_selector:                         
# selector to select a range of samples based on the sorted specified field value from smallest to largest.
      field_key: ''                                           
# the target keys corresponding to multi-level field information need to be separated by '.'
      lower_percentile:                                       
# the lower bound of the percentile to be sampled
      upper_percentile:                                       
# the upper bound of the percentile to be sampled
      lower_rank:                                             
# the lower rank of the percentile to be sampled
      upper_rank:                                             
# the upper rank of the percentile to be sampled


  - topk_specified_field_selector:                          
# selector to select top samples based on the sorted specified field
      field_key: ''                                           
# the target keys corresponding to multi-level field information need to be separated by '.'
      top_ratio:                                              
# ratio of selected top samples
      topk:                                                   
# number of selected top sample
      reverse: True                                           
# determine the sorting rule, if reverse=True, then sort in descending order

image-text-similarity-filter

  - image_text_similarity_filter:                           
# filter samples according to the similarity between image and text.
      hf_clip: openai/clip-vit-base-patch32                   
# name of used Hugging Face clip
      min_score: 0.1                                          
# the min similarity of filter range
      max_score: 1.0                                          
# the max similarity of filter range
      horizontal_flip: false                                  
# flip image horizontally (left to right).
      vertical_flip: false                                    
# flip image vertically (top to bottom).
      reduce_mode: avg                                        
# reduce mode when one text corresponds to multiple images in a chunk,  must be one of ['avg','max', 'min'].
      any_or_all: any                                         
# keep this sample when any/all images meet the filter condition
      mem_required: '1500MB'                                  
# This operation (Op) utilizes deep neural network models that consume a significant amount of memory for computation, hence the system's available memory might constrains the maximum number of processes that can be launched

image-text-matching-filter

phrase-grounding-recall-filter

模型训练

 ########################### Editable Part Begins ###########################  
export CUDA_VISIBLE_DEVICES=0,1

export NCCL_P2P_DISABLE=1

export NCCL_IB_DISABLE=1



# exp meta information

EXP_NAME=default

PRETRAIN_DATASET=../output/image_captioning_output/res_10k.jsonl PRETRAIN_DATASET_IMAGE_PATH=../input/pretrain_stage_1_10k



# training args

# pretraining

# make sure PRETRAIN_BATCH_SIZE_PER_GPU *PRETRAIN_GRADIENT_ACCUMULATION_STEPS * num_gpus = 256 
PRETRAIN_BATCH_SIZE_PER_GPU=4 PRETRAIN_GRADIENT_ACCUMULATION_STEPS=32 PRETRAIN_DATALOADER_NUM_WORKERS=4 



# finetuning 
# make sure FINETUNE_BATCH_SIZE_PER_GPU * FINETUNE_GRADIENT_ACCUMULATION_STEPS * num_gpus = 128 

FINETUNE_BATCH_SIZE_PER_GPU=4 FINETUNE_GRADIENT_ACCUMULATION_STEPS=16 FINETUNE_DATALOADER_NUM_WORKERS=4 



# log and ckpt 
LOGGING_STEP=1 
CKPT_SAVE_STEPS=100 
TOTAL_SAVE_CKPT_LIMIT=1 


# inference args 
# inference for some benchmarks supports multi-gpus 
INFER_CUDA_IDX="0" 

############################ Editable Part Ends ############################

thanks to Datawhale夏令营—(4)多模态大模型 (notion.site)

等合成400kmapping数据要到明天早上，准备待会仔细看一下算力云上储存数据的具体形式，挑选合适的算子

这期间也可以试试从400？数据量疯狂增大以后模型的表现有没有变好

TAT我把这个实例删掉了就像第一次把fulldata下载到系统盘以后把第一个实例删掉一样，跑代码就像我的人生，我不是那种能够看着失误的地方和已经付出的部分想着better than nothing修修补补的人，但凡有不合心意的地方，我只能从头开始，不然只是在浪费时间的开机的金钱（还有我吃下去的粮食）……