ucofia GitHub 向日葵 ssh 报错回顾一览

lisky_HF

已于 2024-02-25 20:12:08 修改

阅读量516

点赞数 8

文章标签： pycharm

于 2024-01-24 20:34:02 首次发布

本文链接：https://blog.csdn.net/m0_74231873/article/details/135830921

版权

这是环境配置

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=10.2 -c pytorch

train

python train.py -mode predcls -datasize large -data_path $DATAPATH -rel_mem_compute joint -rel_mem_weight_type simple -mem_fusion late -mem_feat_selection manual -mem_feat_lambda 0.5 -rel_head gmm -obj_head linear -K 6 -lr 1e-5 -save_path output/

evaluation

python test.py -mode predcls -datasize large -data_path $DATAPATH -model_path $MODELPATH -rel_mem_compute joint -rel_mem_weight_type simple -mem_fusion late -mem_feat_selection manual -mem_feat_lambda 0.5 -rel_head gmm -obj_head linear -K 6

额外配置项

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

数据集下载网址

VATEX: https://eric-xw.github.io/vatex-website/download.html

MSRVTT: https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

MSVD: https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip

Activity-Net: http://activity-net.org/download.html

DiDeMo: https://drive.google.com/drive/u/0/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

报错

第一次

(ucofia) dell@dell-Precision-7820-Tower:/media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train$ sh scripts/train_msrvtt.sh
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
01/28/2024 22:08:43 - INFO -   Effective parameters:
01/28/2024 22:08:43 - INFO -   Effective parameters:
01/28/2024 22:08:43 - INFO -     <<< batch_size: 128
01/28/2024 22:08:43 - INFO -     <<< batch_size: 128
01/28/2024 22:08:43 - INFO -     <<< batch_size_val: 32
01/28/2024 22:08:43 - INFO -     <<< batch_size_val: 32
01/28/2024 22:08:43 - INFO -     <<< cache_dir:
01/28/2024 22:08:43 - INFO -     <<< cache_dir:
01/28/2024 22:08:43 - INFO -     <<< coef_lr: 0.001
01/28/2024 22:08:43 - INFO -     <<< coef_lr: 0.001
01/28/2024 22:08:43 - INFO -     <<< cross_model: cross-base
01/28/2024 22:08:43 - INFO -     <<< cross_model: cross-base
01/28/2024 22:08:43 - INFO -     <<< cross_num_hidden_layers: 4
01/28/2024 22:08:43 - INFO -     <<< cross_num_hidden_layers: 4
01/28/2024 22:08:43 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/28/2024 22:08:43 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/28/2024 22:08:43 - INFO -     <<< datatype: msrvtt
01/28/2024 22:08:43 - INFO -     <<< do_eval: False
01/28/2024 22:08:43 - INFO -     <<< datatype: msrvtt
01/28/2024 22:08:43 - INFO -     <<< do_lower_case: False
01/28/2024 22:08:43 - INFO -     <<< do_eval: False
01/28/2024 22:08:43 - INFO -     <<< do_pretrain: False
01/28/2024 22:08:43 - INFO -     <<< do_lower_case: False
01/28/2024 22:08:43 - INFO -     <<< do_train: True
01/28/2024 22:08:43 - INFO -     <<< do_pretrain: False
01/28/2024 22:08:43 - INFO -     <<< epochs: 15
01/28/2024 22:08:43 - INFO -     <<< do_train: True
01/28/2024 22:08:43 - INFO -     <<< eval_frame_order: 0
01/28/2024 22:08:43 - INFO -     <<< epochs: 15
01/28/2024 22:08:43 - INFO -     <<< eval_frame_order: 0
01/28/2024 22:08:43 - INFO -     <<< expand_msrvtt_sentences: True
01/28/2024 22:08:43 - INFO -     <<< expand_msrvtt_sentences: True
01/28/2024 22:08:43 - INFO -     <<< feature_framerate: 1
01/28/2024 22:08:43 - INFO -     <<< feature_framerate: 1
01/28/2024 22:08:43 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/28/2024 22:08:43 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/28/2024 22:08:43 - INFO -     <<< fp16: False
01/28/2024 22:08:43 - INFO -     <<< fp16: False
01/28/2024 22:08:43 - INFO -     <<< fp16_opt_level: O1
01/28/2024 22:08:43 - INFO -     <<< fp16_opt_level: O1
01/28/2024 22:08:43 - INFO -     <<< freeze_layer_num: 0
01/28/2024 22:08:43 - INFO -     <<< freeze_layer_num: 0
01/28/2024 22:08:43 - INFO -     <<< gradient_accumulation_steps: 1
01/28/2024 22:08:43 - INFO -     <<< gradient_accumulation_steps: 1
01/28/2024 22:08:43 - INFO -     <<< hard_negative_rate: 0.5
01/28/2024 22:08:43 - INFO -     <<< hard_negative_rate: 0.5
01/28/2024 22:08:43 - INFO -     <<< init_model: None
01/28/2024 22:08:43 - INFO -     <<< init_model: None
01/28/2024 22:08:43 - INFO -     <<< linear_patch: 2d
01/28/2024 22:08:43 - INFO -     <<< linear_patch: 2d
01/28/2024 22:08:43 - INFO -     <<< local_rank: 0
01/28/2024 22:08:43 - INFO -     <<< local_rank: 0
01/28/2024 22:08:43 - INFO -     <<< loose_type: True
01/28/2024 22:08:43 - INFO -     <<< loose_type: True
01/28/2024 22:08:43 - INFO -     <<< lr: 0.0001
01/28/2024 22:08:43 - INFO -     <<< lr: 0.0001
01/28/2024 22:08:43 - INFO -     <<< lr_decay: 0.9
01/28/2024 22:08:43 - INFO -     <<< lr_decay: 0.9
01/28/2024 22:08:43 - INFO -     <<< margin: 0.1
01/28/2024 22:08:43 - INFO -     <<< margin: 0.1
01/28/2024 22:08:43 - INFO -     <<< max_frames: 12
01/28/2024 22:08:43 - INFO -     <<< max_frames: 12
01/28/2024 22:08:43 - INFO -     <<< max_words: 32
01/28/2024 22:08:43 - INFO -     <<< max_words: 32
01/28/2024 22:08:43 - INFO -     <<< n_display: 100
01/28/2024 22:08:43 - INFO -     <<< n_gpu: 1
01/28/2024 22:08:43 - INFO -     <<< n_display: 100
01/28/2024 22:08:43 - INFO -     <<< n_pair: 1
01/28/2024 22:08:43 - INFO -     <<< n_gpu: 1
01/28/2024 22:08:43 - INFO -     <<< negative_weighting: 1
01/28/2024 22:08:43 - INFO -     <<< n_pair: 1
01/28/2024 22:08:43 - INFO -     <<< num_thread_reader: 8
01/28/2024 22:08:43 - INFO -     <<< negative_weighting: 1
01/28/2024 22:08:43 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/28/2024 22:08:43 - INFO -     <<< num_thread_reader: 8
01/28/2024 22:08:43 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/28/2024 22:08:43 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/28/2024 22:08:43 - INFO -     <<< rank: 2
01/28/2024 22:08:43 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/28/2024 22:08:43 - INFO -     <<< resume_model: None
01/28/2024 22:08:43 - INFO -     <<< rank: 3
01/28/2024 22:08:43 - INFO -     <<< sampled_use_mil: False
01/28/2024 22:08:43 - INFO -     <<< resume_model: None
01/28/2024 22:08:43 - INFO -     <<< seed: 42
01/28/2024 22:08:43 - INFO -     <<< sampled_use_mil: False
01/28/2024 22:08:43 - INFO -     <<< sim_header: seqTransf
01/28/2024 22:08:43 - INFO -     <<< seed: 42
01/28/2024 22:08:43 - INFO -     <<< slice_framepos: 2
01/28/2024 22:08:43 - INFO -     <<< sim_header: seqTransf
01/28/2024 22:08:43 - INFO -     <<< task_type: retrieval
01/28/2024 22:08:43 - INFO -     <<< slice_framepos: 2
01/28/2024 22:08:43 - INFO -     <<< text_num_hidden_layers: 12
01/28/2024 22:08:43 - INFO -     <<< task_type: retrieval
01/28/2024 22:08:43 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/28/2024 22:08:43 - INFO -     <<< text_num_hidden_layers: 12
01/28/2024 22:08:43 - INFO -     <<< train_frame_order: 0
01/28/2024 22:08:43 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/28/2024 22:08:43 - INFO -     <<< use_mil: False
01/28/2024 22:08:43 - INFO -     <<< train_frame_order: 0
01/28/2024 22:08:43 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/28/2024 22:08:43 - INFO -     <<< use_mil: False
01/28/2024 22:08:43 - INFO -     <<< video_dim: 1024
01/28/2024 22:08:43 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/28/2024 22:08:43 - INFO -     <<< visual_num_hidden_layers: 12
01/28/2024 22:08:43 - INFO -     <<< video_dim: 1024
01/28/2024 22:08:43 - INFO -     <<< warmup_proportion: 0.1
01/28/2024 22:08:43 - INFO -     <<< visual_num_hidden_layers: 12
01/28/2024 22:08:43 - INFO -     <<< world_size: 4
01/28/2024 22:08:43 - INFO -     <<< warmup_proportion: 0.1
01/28/2024 22:08:43 - INFO -     <<< world_size: 4
01/28/2024 22:08:43 - INFO -   device: cuda:0 n_gpu: 1
01/28/2024 22:08:43 - INFO -   device: cuda:0 n_gpu: 1
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

01/28/2024 22:08:45 - INFO -   Weight doesn't exsits. /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train/modules/cross-base/cross_pytorch_model.bin
01/28/2024 22:08:45 - WARNING -   Stage-One:True, Stage-Two:False
01/28/2024 22:08:45 - WARNING -   Test retrieval by loose type.
01/28/2024 22:08:45 - WARNING -     embed_dim: 512
01/28/2024 22:08:45 - WARNING -     image_resolution: 224
01/28/2024 22:08:45 - WARNING -     vision_layers: 12
01/28/2024 22:08:45 - WARNING -     vision_width: 768
01/28/2024 22:08:45 - WARNING -     vision_patch_size: 32
01/28/2024 22:08:45 - WARNING -     context_length: 77
01/28/2024 22:08:45 - WARNING -     vocab_size: 49408
01/28/2024 22:08:45 - WARNING -     transformer_width: 512
01/28/2024 22:08:45 - WARNING -     transformer_heads: 8
01/28/2024 22:08:45 - WARNING -     transformer_layers: 12
01/28/2024 22:08:45 - WARNING -         linear_patch: 2d
01/28/2024 22:08:45 - WARNING -     cut_top_layer: 0
01/28/2024 22:08:46 - WARNING -     sim_header: seqTransf
01/28/2024 22:08:47 - WARNING -     sim_header: seqTransf
01/28/2024 22:08:47 - WARNING -     sim_header: seqTransf
01/28/2024 22:08:47 - WARNING -     sim_header: seqTransf
01/28/2024 22:08:53 - INFO -   --------------------
01/28/2024 22:08:53 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/28/2024 22:08:53 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/28/2024 22:08:53 - WARNING -   Using patch shift!
01/28/2024 22:08:53 - WARNING -   Using patch shift!
01/28/2024 22:08:53 - INFO -   --------------------
01/28/2024 22:08:53 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/28/2024 22:08:53 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/28/2024 22:08:53 - WARNING -   Using patch shift!
01/28/2024 22:08:53 - WARNING -   Using patch shift!
01/28/2024 22:08:54 - INFO -   --------------------
01/28/2024 22:08:54 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/28/2024 22:08:54 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/28/2024 22:08:54 - WARNING -   Using patch shift!
01/28/2024 22:08:54 - WARNING -   Using patch shift!
01/28/2024 22:08:54 - INFO -   --------------------
01/28/2024 22:08:54 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/28/2024 22:08:54 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/28/2024 22:08:54 - WARNING -   Using patch shift!
01/28/2024 22:08:54 - WARNING -   Using patch shift!
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/28/2024 22:08:54 - INFO -   ***** Running test *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
01/28/2024 22:08:54 - INFO -     Batch size = 32
01/28/2024 22:08:54 - INFO -     Num steps = 32
01/28/2024 22:08:54 - INFO -   ***** Running val *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/28/2024 22:08:54 - INFO -   ***** Running test *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
01/28/2024 22:08:54 - INFO -     Batch size = 32
01/28/2024 22:08:54 - INFO -     Num steps = 32
01/28/2024 22:08:54 - INFO -   ***** Running val *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/28/2024 22:08:54 - INFO -   ***** Running test *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
01/28/2024 22:08:54 - INFO -     Batch size = 32
01/28/2024 22:08:54 - INFO -     Num steps = 32
01/28/2024 22:08:54 - INFO -   ***** Running val *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/28/2024 22:08:54 - INFO -   ***** Running test *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
01/28/2024 22:08:54 - INFO -     Batch size = 32
01/28/2024 22:08:54 - INFO -     Num steps = 32
01/28/2024 22:08:54 - INFO -   ***** Running val *****
01/28/2024 22:08:54 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
Traceback (most recent call last):
File "main_ucofia.py", line 554, in <module>
Traceback (most recent call last):
File "main_ucofia.py", line 554, in <module>
Traceback (most recent call last):
File "main_ucofia.py", line 554, in <module>
    main()
File "main_ucofia.py", line 504, in main
    main()
File "main_ucofia.py", line 504, in main
    main()
File "main_ucofia.py", line 504, in main
    optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, args.local_rank, coef_lr=coef_lr)
File "main_ucofia.py", line 214, in prep_optimizer
    optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, args.local_rank, coef_lr=coef_lr)
File "main_ucofia.py", line 214, in prep_optimizer
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, args.local_rank, coef_lr=coef_lr)
      File "main_ucofia.py", line 214, in prep_optimizer
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],Traceback (most recent call last):

File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
File "main_ucofia.py", line 554, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
        return dist._verify_params_across_processes(process_group, tensors, logger)_verify_param_shape_across_processes(self.process_group, parameters)

File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    return dist._verify_params_across_processes(process_group, tensors, logger)
    _verify_param_shape_across_processes(self.process_group, parameters)RuntimeError
:   File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    main()
File "main_ucofia.py", line 504, in main
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, args.local_rank, coef_lr=coef_lr)
File "main_ucofia.py", line 214, in prep_optimizer
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3510273) of binary: /home/dell/anaconda3/envs/ucofia/bin/python
Traceback (most recent call last):
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main_ucofia.py FAILED
------------------------------------------------------------
Failures:
[1]:
time      : 2024-01-28_22:09:12
host      : dell-Precision-7820-Tower
rank      : 1 (local_rank: 1)
exitcode : 1 (pid: 3510274)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time      : 2024-01-28_22:09:12
host      : dell-Precision-7820-Tower
rank      : 2 (local_rank: 2)
exitcode : 1 (pid: 3510275)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time      : 2024-01-28_22:09:12
host      : dell-Precision-7820-Tower
rank      : 3 (local_rank: 3)
exitcode : 1 (pid: 3510276)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2024-01-28_22:09:12
host      : dell-Precision-7820-Tower
rank      : 0 (local_rank: 0)
exitcode : 1 (pid: 3510273)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

给脚本前面export一些。。。

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export NCCL_DEBUG=info
export NCCL_SOCKET_IFNAME=wlp23s0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=1

再把端口换几个

20240129

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
01/29/2024 20:56:09 - INFO -   Effective parameters:
01/29/2024 20:56:09 - INFO -     <<< batch_size: 128
01/29/2024 20:56:09 - INFO -     <<< batch_size_val: 32
01/29/2024 20:56:09 - INFO -     <<< cache_dir:
01/29/2024 20:56:09 - INFO -     <<< coef_lr: 0.001
01/29/2024 20:56:09 - INFO -     <<< cross_model: cross-base
01/29/2024 20:56:09 - INFO -     <<< cross_num_hidden_layers: 4
01/29/2024 20:56:09 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/29/2024 20:56:09 - INFO -     <<< datatype: msrvtt
01/29/2024 20:56:09 - INFO -     <<< do_eval: False
01/29/2024 20:56:09 - INFO -     <<< do_lower_case: False
01/29/2024 20:56:09 - INFO -     <<< do_pretrain: False
01/29/2024 20:56:09 - INFO -     <<< do_train: True
01/29/2024 20:56:09 - INFO -     <<< epochs: 15
01/29/2024 20:56:09 - INFO -     <<< eval_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< expand_msrvtt_sentences: True
01/29/2024 20:56:09 - INFO -     <<< feature_framerate: 1
01/29/2024 20:56:09 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/29/2024 20:56:09 - INFO -     <<< fp16: False
01/29/2024 20:56:09 - INFO -     <<< fp16_opt_level: O1
01/29/2024 20:56:09 - INFO -     <<< freeze_layer_num: 0
01/29/2024 20:56:09 - INFO -     <<< gradient_accumulation_steps: 1
01/29/2024 20:56:09 - INFO -     <<< hard_negative_rate: 0.5
01/29/2024 20:56:09 - INFO -     <<< init_model: None
01/29/2024 20:56:09 - INFO -     <<< linear_patch: 2d
01/29/2024 20:56:09 - INFO -     <<< local_rank: 0
01/29/2024 20:56:09 - INFO -     <<< loose_type: True
01/29/2024 20:56:09 - INFO -     <<< lr: 0.0001
01/29/2024 20:56:09 - INFO -     <<< lr_decay: 0.9
01/29/2024 20:56:09 - INFO -     <<< margin: 0.1
01/29/2024 20:56:09 - INFO -     <<< max_frames: 12
01/29/2024 20:56:09 - INFO -     <<< max_words: 32
01/29/2024 20:56:09 - INFO -     <<< n_display: 100
01/29/2024 20:56:09 - INFO -     <<< n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< n_pair: 1
01/29/2024 20:56:09 - INFO -     <<< negative_weighting: 1
01/29/2024 20:56:09 - INFO -     <<< num_thread_reader: 8
01/29/2024 20:56:09 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/29/2024 20:56:09 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/29/2024 20:56:09 - INFO -     <<< rank: 3
01/29/2024 20:56:09 - INFO -     <<< resume_model: None
01/29/2024 20:56:09 - INFO -     <<< sampled_use_mil: False
01/29/2024 20:56:09 - INFO -     <<< seed: 42
01/29/2024 20:56:09 - INFO -     <<< sim_header: seqTransf
01/29/2024 20:56:09 - INFO -     <<< slice_framepos: 2
01/29/2024 20:56:09 - INFO -     <<< task_type: retrieval
01/29/2024 20:56:09 - INFO -     <<< text_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/29/2024 20:56:09 - INFO -     <<< train_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< use_mil: False
01/29/2024 20:56:09 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/29/2024 20:56:09 - INFO -     <<< video_dim: 1024
01/29/2024 20:56:09 - INFO -     <<< visual_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< warmup_proportion: 0.1
01/29/2024 20:56:09 - INFO -     <<< world_size: 4
01/29/2024 20:56:09 - INFO -   device: cuda:0 n_gpu: 1
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
01/29/2024 20:56:09 - INFO -   Effective parameters:
01/29/2024 20:56:09 - INFO -   Effective parameters:
01/29/2024 20:56:09 - INFO -     <<< batch_size: 128
01/29/2024 20:56:09 - INFO -     <<< batch_size_val: 32
01/29/2024 20:56:09 - INFO -     <<< cache_dir:
01/29/2024 20:56:09 - INFO -     <<< batch_size: 128
01/29/2024 20:56:09 - INFO -     <<< coef_lr: 0.001
01/29/2024 20:56:09 - INFO -     <<< batch_size_val: 32
01/29/2024 20:56:09 - INFO -     <<< cross_model: cross-base
01/29/2024 20:56:09 - INFO -     <<< cross_num_hidden_layers: 4
01/29/2024 20:56:09 - INFO -     <<< cache_dir:
01/29/2024 20:56:09 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/29/2024 20:56:09 - INFO -     <<< coef_lr: 0.001
01/29/2024 20:56:09 - INFO -     <<< datatype: msrvtt
01/29/2024 20:56:09 - INFO -     <<< cross_model: cross-base
01/29/2024 20:56:09 - INFO -     <<< do_eval: False
01/29/2024 20:56:09 - INFO -     <<< cross_num_hidden_layers: 4
01/29/2024 20:56:09 - INFO -     <<< do_lower_case: False
01/29/2024 20:56:09 - INFO -     <<< do_pretrain: False
01/29/2024 20:56:09 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/29/2024 20:56:09 - INFO -     <<< do_train: True
01/29/2024 20:56:09 - INFO -     <<< datatype: msrvtt
01/29/2024 20:56:09 - INFO -     <<< epochs: 15
01/29/2024 20:56:09 - INFO -     <<< do_eval: False
01/29/2024 20:56:09 - INFO -     <<< eval_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< do_lower_case: False
01/29/2024 20:56:09 - INFO -     <<< expand_msrvtt_sentences: True
01/29/2024 20:56:09 - INFO -     <<< do_pretrain: False
01/29/2024 20:56:09 - INFO -     <<< feature_framerate: 1
01/29/2024 20:56:09 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/29/2024 20:56:09 - INFO -     <<< do_train: True
01/29/2024 20:56:09 - INFO -     <<< fp16: False
01/29/2024 20:56:09 - INFO -     <<< epochs: 15
01/29/2024 20:56:09 - INFO -     <<< fp16_opt_level: O1
01/29/2024 20:56:09 - INFO -     <<< eval_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< freeze_layer_num: 0
01/29/2024 20:56:09 - INFO -     <<< expand_msrvtt_sentences: True
01/29/2024 20:56:09 - INFO -     <<< gradient_accumulation_steps: 1
01/29/2024 20:56:09 - INFO -     <<< feature_framerate: 1
01/29/2024 20:56:09 - INFO -     <<< hard_negative_rate: 0.5
01/29/2024 20:56:09 - INFO -     <<< init_model: None
01/29/2024 20:56:09 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/29/2024 20:56:09 - INFO -     <<< linear_patch: 2d
01/29/2024 20:56:09 - INFO -     <<< fp16: False
01/29/2024 20:56:09 - INFO -     <<< local_rank: 0
01/29/2024 20:56:09 - INFO -     <<< fp16_opt_level: O1
01/29/2024 20:56:09 - INFO -     <<< loose_type: True
01/29/2024 20:56:09 - INFO -     <<< freeze_layer_num: 0
01/29/2024 20:56:09 - INFO -     <<< lr: 0.0001
01/29/2024 20:56:09 - INFO -     <<< lr_decay: 0.9
01/29/2024 20:56:09 - INFO -     <<< gradient_accumulation_steps: 1
01/29/2024 20:56:09 - INFO -     <<< margin: 0.1
01/29/2024 20:56:09 - INFO -     <<< hard_negative_rate: 0.5
01/29/2024 20:56:09 - INFO -     <<< max_frames: 12
01/29/2024 20:56:09 - INFO -     <<< init_model: None
01/29/2024 20:56:09 - INFO -     <<< max_words: 32
01/29/2024 20:56:09 - INFO -     <<< linear_patch: 2d
01/29/2024 20:56:09 - INFO -     <<< n_display: 100
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
01/29/2024 20:56:09 - INFO -     <<< local_rank: 0
01/29/2024 20:56:09 - INFO -     <<< n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< n_pair: 1
01/29/2024 20:56:09 - INFO -     <<< loose_type: True
01/29/2024 20:56:09 - INFO -     <<< negative_weighting: 1
01/29/2024 20:56:09 - INFO -     <<< lr: 0.0001
01/29/2024 20:56:09 - INFO -     <<< num_thread_reader: 8
01/29/2024 20:56:09 - INFO -     <<< lr_decay: 0.9
01/29/2024 20:56:09 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/29/2024 20:56:09 - INFO -     <<< margin: 0.1
01/29/2024 20:56:09 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/29/2024 20:56:09 - INFO -     <<< rank: 2
01/29/2024 20:56:09 - INFO -     <<< max_frames: 12
01/29/2024 20:56:09 - INFO -     <<< resume_model: None
01/29/2024 20:56:09 - INFO -     <<< max_words: 32
01/29/2024 20:56:09 - INFO -     <<< sampled_use_mil: False
01/29/2024 20:56:09 - INFO -     <<< n_display: 100
01/29/2024 20:56:09 - INFO -     <<< seed: 42
01/29/2024 20:56:09 - INFO -     <<< n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< sim_header: seqTransf
01/29/2024 20:56:09 - INFO -     <<< n_pair: 1
01/29/2024 20:56:09 - INFO -   Effective parameters:
01/29/2024 20:56:09 - INFO -     <<< slice_framepos: 2
01/29/2024 20:56:09 - INFO -     <<< task_type: retrieval
01/29/2024 20:56:09 - INFO -     <<< negative_weighting: 1
01/29/2024 20:56:09 - INFO -     <<< text_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< num_thread_reader: 8
01/29/2024 20:56:09 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/29/2024 20:56:09 - INFO -     <<< batch_size: 128
01/29/2024 20:56:09 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/29/2024 20:56:09 - INFO -     <<< train_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/29/2024 20:56:09 - INFO -     <<< use_mil: False
01/29/2024 20:56:09 - INFO -     <<< batch_size_val: 32
01/29/2024 20:56:09 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/29/2024 20:56:09 - INFO -     <<< rank: 0
01/29/2024 20:56:09 - INFO -     <<< video_dim: 1024
01/29/2024 20:56:09 - INFO -     <<< cache_dir:
01/29/2024 20:56:09 - INFO -     <<< resume_model: None
01/29/2024 20:56:09 - INFO -     <<< visual_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< sampled_use_mil: False
01/29/2024 20:56:09 - INFO -     <<< coef_lr: 0.001
01/29/2024 20:56:09 - INFO -     <<< warmup_proportion: 0.1
01/29/2024 20:56:09 - INFO -     <<< seed: 42
01/29/2024 20:56:09 - INFO -     <<< world_size: 4
01/29/2024 20:56:09 - INFO -     <<< cross_model: cross-base
01/29/2024 20:56:09 - INFO -     <<< sim_header: seqTransf
01/29/2024 20:56:09 - INFO -   device: cuda:0 n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< cross_num_hidden_layers: 4
01/29/2024 20:56:09 - INFO -     <<< slice_framepos: 2
01/29/2024 20:56:09 - INFO -     <<< task_type: retrieval
01/29/2024 20:56:09 - INFO -     <<< data_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_data.json
01/29/2024 20:56:09 - INFO -     <<< text_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< datatype: msrvtt
01/29/2024 20:56:09 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/29/2024 20:56:09 - INFO -     <<< do_eval: False
01/29/2024 20:56:09 - INFO -     <<< train_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< do_lower_case: False
01/29/2024 20:56:09 - INFO -     <<< use_mil: False
01/29/2024 20:56:09 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/29/2024 20:56:09 - INFO -     <<< do_pretrain: False
01/29/2024 20:56:09 - INFO -     <<< video_dim: 1024
01/29/2024 20:56:09 - INFO -     <<< do_train: True
01/29/2024 20:56:09 - INFO -     <<< visual_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< epochs: 15
01/29/2024 20:56:09 - INFO -     <<< warmup_proportion: 0.1
01/29/2024 20:56:09 - INFO -     <<< eval_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< world_size: 4
01/29/2024 20:56:09 - INFO -     <<< expand_msrvtt_sentences: True
01/29/2024 20:56:09 - INFO -   device: cuda:0 n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< feature_framerate: 1
01/29/2024 20:56:09 - INFO -     <<< features_path: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/videos/all
01/29/2024 20:56:09 - INFO -     <<< fp16: False
01/29/2024 20:56:09 - INFO -     <<< fp16_opt_level: O1
01/29/2024 20:56:09 - INFO -     <<< freeze_layer_num: 0
01/29/2024 20:56:09 - INFO -     <<< gradient_accumulation_steps: 1
01/29/2024 20:56:09 - INFO -     <<< hard_negative_rate: 0.5
01/29/2024 20:56:09 - INFO -     <<< init_model: None
01/29/2024 20:56:09 - INFO -     <<< linear_patch: 2d
01/29/2024 20:56:09 - INFO -     <<< local_rank: 0
01/29/2024 20:56:09 - INFO -     <<< loose_type: True
01/29/2024 20:56:09 - INFO -     <<< lr: 0.0001
01/29/2024 20:56:09 - INFO -     <<< lr_decay: 0.9
01/29/2024 20:56:09 - INFO -     <<< margin: 0.1
01/29/2024 20:56:09 - INFO -     <<< max_frames: 12
01/29/2024 20:56:09 - INFO -     <<< max_words: 32
01/29/2024 20:56:09 - INFO -     <<< n_display: 100
01/29/2024 20:56:09 - INFO -     <<< n_gpu: 1
01/29/2024 20:56:09 - INFO -     <<< n_pair: 1
01/29/2024 20:56:09 - INFO -     <<< negative_weighting: 1
01/29/2024 20:56:09 - INFO -     <<< num_thread_reader: 8
01/29/2024 20:56:09 - INFO -     <<< output_dir: [/home/dell/imuse_videoUnderstanding/litianqi/UCoFiA-main/output]
01/29/2024 20:56:09 - INFO -     <<< pretrained_clip_name: ViT-B/32
01/29/2024 20:56:09 - INFO -     <<< rank: 1
01/29/2024 20:56:09 - INFO -     <<< resume_model: None
01/29/2024 20:56:09 - INFO -     <<< sampled_use_mil: False
01/29/2024 20:56:09 - INFO -     <<< seed: 42
01/29/2024 20:56:09 - INFO -     <<< sim_header: seqTransf
01/29/2024 20:56:09 - INFO -     <<< slice_framepos: 2
01/29/2024 20:56:09 - INFO -     <<< task_type: retrieval
01/29/2024 20:56:09 - INFO -     <<< text_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< train_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/29/2024 20:56:09 - INFO -     <<< train_frame_order: 0
01/29/2024 20:56:09 - INFO -     <<< use_mil: False
01/29/2024 20:56:09 - INFO -     <<< val_csv: /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/data/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/29/2024 20:56:09 - INFO -     <<< video_dim: 1024
01/29/2024 20:56:09 - INFO -     <<< visual_num_hidden_layers: 12
01/29/2024 20:56:09 - INFO -     <<< warmup_proportion: 0.1
01/29/2024 20:56:09 - INFO -     <<< world_size: 4
01/29/2024 20:56:09 - INFO -   device: cuda:0 n_gpu: 1
01/29/2024 20:56:10 - INFO -   loading archive file /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train/modules/cross-base
01/29/2024 20:56:10 - INFO -   Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"max_position_embeddings": 128,
"num_attention_heads": 8,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 512
}

01/29/2024 20:56:10 - WARNING -         linear_patch: 2d
01/29/2024 20:56:10 - WARNING -     cut_top_layer: 0
01/29/2024 20:56:10 - INFO -   Weight doesn't exsits. /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train/modules/cross-base/cross_pytorch_model.bin
01/29/2024 20:56:10 - WARNING -   Stage-One:True, Stage-Two:False
01/29/2024 20:56:10 - WARNING -   Test retrieval by loose type.
01/29/2024 20:56:10 - WARNING -     embed_dim: 512
01/29/2024 20:56:10 - WARNING -     image_resolution: 224
01/29/2024 20:56:10 - WARNING -     vision_layers: 12
01/29/2024 20:56:10 - WARNING -     vision_width: 768
01/29/2024 20:56:10 - WARNING -     vision_patch_size: 32
01/29/2024 20:56:10 - WARNING -     context_length: 77
01/29/2024 20:56:10 - WARNING -     vocab_size: 49408
01/29/2024 20:56:10 - WARNING -     transformer_width: 512
01/29/2024 20:56:10 - WARNING -     transformer_heads: 8
01/29/2024 20:56:10 - WARNING -     transformer_layers: 12
01/29/2024 20:56:10 - WARNING -         linear_patch: 2d
01/29/2024 20:56:10 - WARNING -     cut_top_layer: 0
01/29/2024 20:56:10 - INFO -   loading archive file /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train/modules/cross-base
01/29/2024 20:56:10 - INFO -   Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"max_position_embeddings": 128,
"num_attention_heads": 8,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 512
}

01/29/2024 20:56:10 - INFO -   Weight doesn't exsits. /media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train/modules/cross-base/cross_pytorch_model.bin
01/29/2024 20:56:10 - WARNING -   Stage-One:True, Stage-Two:False
01/29/2024 20:56:10 - WARNING -   Test retrieval by loose type.
01/29/2024 20:56:10 - WARNING -     embed_dim: 512
01/29/2024 20:56:10 - WARNING -     image_resolution: 224
01/29/2024 20:56:10 - WARNING -     vision_layers: 12
01/29/2024 20:56:10 - WARNING -     vision_width: 768
01/29/2024 20:56:10 - WARNING -     vision_patch_size: 32
01/29/2024 20:56:10 - WARNING -     context_length: 77
01/29/2024 20:56:10 - WARNING -     vocab_size: 49408
01/29/2024 20:56:10 - WARNING -     transformer_width: 512
01/29/2024 20:56:10 - WARNING -     transformer_heads: 8
01/29/2024 20:56:10 - WARNING -     transformer_layers: 12
01/29/2024 20:56:10 - WARNING -         linear_patch: 2d
01/29/2024 20:56:10 - WARNING -     cut_top_layer: 0
01/29/2024 20:56:12 - WARNING -     sim_header: seqTransf
01/29/2024 20:56:12 - WARNING -     sim_header: seqTransf
01/29/2024 20:56:12 - WARNING -     sim_header: seqTransf
01/29/2024 20:56:12 - WARNING -     sim_header: seqTransf
01/29/2024 20:56:19 - INFO -   --------------------
01/29/2024 20:56:19 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/29/2024 20:56:19 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - INFO -   --------------------
01/29/2024 20:56:19 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/29/2024 20:56:19 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - INFO -   --------------------
01/29/2024 20:56:19 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/29/2024 20:56:19 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - INFO -   --------------------
01/29/2024 20:56:19 - INFO -   Weights of UCoFiA not initialized from pretrained model:
   global_mat_weight
   global_mat_weight_1
   word_logit_weight
   frame_logit_weight
   local_mat_weight
   local_mat_weight1
   frame_mat_weight
   word_mat_weight
   frame_mat_weight2
   word_mat_weight2
   pixel_mat_weight
   pixel_mat_weight2
   word_mat_weight_for_pixel
   visual_token_selector.score_predictor.in_conv.0.weight
   visual_token_selector.score_predictor.in_conv.0.bias
   visual_token_selector.score_predictor.in_conv.1.weight
   visual_token_selector.score_predictor.out_conv.0.weight
   visual_token_selector.score_predictor.out_conv.2.weight
01/29/2024 20:56:19 - INFO -   Weights from pretrained model not used in UCoFiA:
   clip.input_resolution
   clip.context_length
   clip.vocab_size
01/29/2024 20:56:19 - WARNING -   Using patch shift!
01/29/2024 20:56:19 - WARNING -   Using patch shift!
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/29/2024 20:56:19 - INFO -   ***** Running test *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
01/29/2024 20:56:19 - INFO -     Batch size = 32
01/29/2024 20:56:19 - INFO -     Num steps = 32
01/29/2024 20:56:19 - INFO -   ***** Running val *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/29/2024 20:56:19 - INFO -   ***** Running test *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
01/29/2024 20:56:19 - INFO -     Batch size = 32
01/29/2024 20:56:19 - INFO -     Num steps = 32
01/29/2024 20:56:19 - INFO -   ***** Running val *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/29/2024 20:56:19 - INFO -   ***** Running test *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
01/29/2024 20:56:19 - INFO -     Batch size = 32
01/29/2024 20:56:19 - INFO -     Num steps = 32
01/29/2024 20:56:19 - INFO -   ***** Running val *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
01/29/2024 20:56:19 - INFO -   ***** Running test *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
01/29/2024 20:56:19 - INFO -     Batch size = 32
01/29/2024 20:56:19 - INFO -     Num steps = 32
01/29/2024 20:56:19 - INFO -   ***** Running val *****
01/29/2024 20:56:19 - INFO -     Num examples = 1000
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(

dell-Precision-7820-Tower:3599208:3599208 [0] enqueue.cc:102 NCCL WARN Cuda failure 'invalid device function'

dell-Precision-7820-Tower:3599208:3599208 [0] bootstrap.cc:40 NCCL WARN Bootstrap : no socket interface found
dell-Precision-7820-Tower:3599208:3599208 [0] NCCL INFO init.cc:98 -> 3
dell-Precision-7820-Tower:3599208:3599208 [0] NCCL INFO init.cc:150 -> 3
dell-Precision-7820-Tower:3599208:3599208 [0] NCCL INFO init.cc:167 -> 3
Traceback (most recent call last):
File "main_ucofia.py", line 554, in <module>
    main()
File "main_ucofia.py", line 504, in main
    optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, args.local_rank, coef_lr=coef_lr)
File "main_ucofia.py", line 214, in prep_optimizer
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3599209 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3599210 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3599211 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3599208) of binary: /home/dell/anaconda3/envs/ucofia/bin/python
Traceback (most recent call last):
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main_ucofia.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2024-01-29_20:56:38
host      : dell-Precision-7820-Tower
rank      : 0 (local_rank: 0)
exitcode : 1 (pid: 3599208)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

这个回学校就又好了，感觉是网络问题。。

(ucofia) dell@dell-Precision-7820-Tower:/media/dell/disak1/imuse_videoUnderstanding/litianqi/UCoFiA-main/train$ sh scripts/train_msrvtt.sh
Traceback (most recent call last):
File "main_ucofia.py", line 21, in <module>
    torch.distributed.init_process_group(backend="nccl")
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 602, in init_process_group
    default_pg = _new_process_group_helper(
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 727, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3608964) of binary: /home/dell/anaconda3/envs/ucofia/bin/python
Traceback (most recent call last):
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dell/anaconda3/envs/ucofia/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main_ucofia.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2024-01-30_20:20:23
host      : dell-Precision-7820-Tower
rank      : 0 (local_rank: 0)
exitcode : 1 (pid: 3608964)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================