可图Kolors-LoRA风格故事挑战赛_创新应用大赛_天池大赛

开通阿里云PAI-DSW试用在教程中已经说明的很详细了。

而这里本文除了选择使用PAI实例外,是尝试在本地进行更多可能。

git lfs install
git clone https://www.modelscope.cn/datasets/maochase/kolors.git
  • 1.
  • 2.

进入文件夹除了baseline外还有Data-Juicer和DiffSynth-Studio:

Data-Juicer:数据处理和转换工具,旨在简化数据的提取、转换和加载过程
DiffSynth-Studio:高效微调训练大模型工具
  • 1.
  • 2.
!pip install simple-aesthetics-predictor
!pip install -v -e data-juicer
!pip uninstall pytorch-lightning -y
!pip install peft lightning pandas torchvision
!pip install -e DiffSynth-Studio
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.

下载数据集部分简单,下面是lora微调,其中部分参数可根据实际需求调整,例如 lora_rank 可以控制 LoRA 模型的参数量:

# 下载模型
from diffsynth import download_models
download_models(["Kolors", "SDXL-vae-fp16-fix"])

#模型训练
import os

cmd = """
python DiffSynth-Studio/examples/train/kolors/train_kolors_lora.py \
  --pretrained_unet_path models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors \
  --pretrained_text_encoder_path models/kolors/Kolors/text_encoder \
  --pretrained_fp16_vae_path models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors \
  --lora_rank 16 \
  --lora_alpha 4.0 \
  --dataset_path data/lora_dataset_processed \
  --output_path ./models \
  --max_epochs 1 \
  --center_crop \
  --use_gradient_checkpointing \
  --precision "16-mixed"
""".strip()

os.system(cmd)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.

这里只是记录结果数据:

Loading models from: models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors
    model_name: sdxl_unet model_class: SDXLUNet
        This model is initialized with extra kwargs: {'is_kolors': True}
    The following models are loaded: ['sdxl_unet'].
Loading models from: models/kolors/Kolors/text_encoder
Loading checkpoint shards: 100%|██████████| 7/7 [00:06<00:00,  1.02it/s]
/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Using 16bit Automatic Mixed Precision (AMP)
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/plugins/precision/amp.py:52: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Missing logger folder: models/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type              | Params | Mode
--------------------------------------------------
0 | pipe | SDXLImagePipeline | 8.9 B  | eval
--------------------------------------------------
23.2 M    Trainable params
8.9 B     Non-trainable params
8.9 B     Total params
35,719.684Total estimated model params size (MB)
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.
    The following models are loaded: ['kolors_text_encoder'].
Loading models from: models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors
    model_name: sdxl_vae_encoder model_class: SDXLVAEEncoder
        This model is initialized with extra kwargs: {'upcast_to_float32': True}
    model_name: sdxl_vae_decoder model_class: SDXLVAEDecoder
        This model is initialized with extra kwargs: {'upcast_to_float32': True}
    The following models are loaded: ['sdxl_vae_encoder', 'sdxl_vae_decoder'].
No sdxl_text_encoder models available.
No sdxl_text_encoder_2 models available.
Using kolors_text_encoder from models/kolors/Kolors/text_encoder.
Using sdxl_unet from models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors.
Using sdxl_vae_decoder from models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors.
Using sdxl_vae_encoder from models/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors.
No sdxl_ipadapter models available.
No sdxl_ipadapter_clip_image_encoder models available.
Switch to Kolors. The prompter and scheduler will be replaced.
Epoch 0: 100%|██████████| 500/500 [08:06<00:00,  1.03it/s, v_num=0, train_loss=0.314]   
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 500/500 [08:07<00:00,  1.03it/s, v_num=0, train_loss=0.314]
0
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.

此时,运行占用23.80GB内存,和20GB显存。

下面查看训练脚本的输入参数:

!python3 DiffSynth-Studio/examples/train/kolors/train_kolors_lora.py -h
  • 1.
usage: train_kolors_lora.py [-h] --pretrained_unet_path PRETRAINED_UNET_PATH
                            --pretrained_text_encoder_path
                            PRETRAINED_TEXT_ENCODER_PATH
                            --pretrained_fp16_vae_path
                            PRETRAINED_FP16_VAE_PATH
                            [--lora_target_modules LORA_TARGET_MODULES]
                            --dataset_path DATASET_PATH
                            [--output_path OUTPUT_PATH]
                            [--steps_per_epoch STEPS_PER_EPOCH]
                            [--height HEIGHT] [--width WIDTH] [--center_crop]
                            [--random_flip] [--batch_size BATCH_SIZE]
                            [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                            [--precision {32,16,16-mixed}]
                            [--learning_rate LEARNING_RATE]
                            [--lora_rank LORA_RANK] [--lora_alpha LORA_ALPHA]
                            [--use_gradient_checkpointing]
                            [--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES]
                            [--training_strategy {auto,deepspeed_stage_1,deepspeed_stage_2,deepspeed_stage_3}]
                            [--max_epochs MAX_EPOCHS]
                            [--modelscope_model_id MODELSCOPE_MODEL_ID]
                            [--modelscope_access_token MODELSCOPE_ACCESS_TOKEN]

Simple example of a training script.

optional arguments:
  -h, --help            show this help message and exit
  --pretrained_unet_path PRETRAINED_UNET_PATH
                        Path to pretrained model (UNet). For example, `models/
                        kolors/Kolors/unet/diffusion_pytorch_model.safetensors
                        `.
  --pretrained_text_encoder_path PRETRAINED_TEXT_ENCODER_PATH
                        Path to pretrained model (Text Encoder). For example,
                        `models/kolors/Kolors/text_encoder`.
  --pretrained_fp16_vae_path PRETRAINED_FP16_VAE_PATH
                        Path to pretrained model (VAE). For example,
                        `models/kolors/Kolors/sdxl-vae-
                        fp16-fix/diffusion_pytorch_model.safetensors`.
  --lora_target_modules LORA_TARGET_MODULES
                        Layers with LoRA modules.
  --dataset_path DATASET_PATH
                        The path of the Dataset.
  --output_path OUTPUT_PATH
                        Path to save the model.
  --steps_per_epoch STEPS_PER_EPOCH
                        Number of steps per epoch.
  --height HEIGHT       Image height.
  --width WIDTH         Image width.
  --center_crop         Whether to center crop the input images to the
                        resolution. If not set, the images will be randomly
                        cropped. The images will be resized to the resolution
                        first before cropping.
  --random_flip         Whether to randomly flip images horizontally
  --batch_size BATCH_SIZE
                        Batch size (per device) for the training dataloader.
  --dataloader_num_workers DATALOADER_NUM_WORKERS
                        Number of subprocesses to use for data loading. 0
                        means that the data will be loaded in the main
                        process.
  --precision {32,16,16-mixed}
                        Training precision
  --learning_rate LEARNING_RATE
                        Learning rate.
  --lora_rank LORA_RANK
                        The dimension of the LoRA update matrices.
  --lora_alpha LORA_ALPHA
                        The weight of the LoRA update matrices.
  --use_gradient_checkpointing
                        Whether to use gradient checkpointing.
  --accumulate_grad_batches ACCUMULATE_GRAD_BATCHES
                        The number of batches in gradient accumulation.
  --training_strategy {auto,deepspeed_stage_1,deepspeed_stage_2,deepspeed_stage_3}
                        Training strategy
  --max_epochs MAX_EPOCHS
                        Number of epochs.
  --modelscope_model_id MODELSCOPE_MODEL_ID
                        Model ID on ModelScope (https://www.modelscope.cn/).
                        The model will be uploaded to ModelScope automatically
                        if you provide a Model ID.
  --modelscope_access_token MODELSCOPE_ACCESS_TOKEN
                        Access key on ModelScope (https://www.modelscope.cn/).
                        Required if you want to upload the model to
                        ModelScope.
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.

加载模型、生成图像:

from diffsynth import ModelManager, SDXLImagePipeline
from peft import LoraConfig, inject_adapter_in_model
import torch


def load_lora(model, lora_rank, lora_alpha, lora_path):
    lora_config = LoraConfig(
        r=lora_rank,
        lora_alpha=lora_alpha,
        init_lora_weights="gaussian",
        target_modules=["to_q", "to_k", "to_v", "to_out"],
    )
    model = inject_adapter_in_model(lora_config, model)
    state_dict = torch.load(lora_path, map_location="cpu")
    model.load_state_dict(state_dict, strict=False)
    return model


# Load models
model_manager = ModelManager(torch_dtype=torch.float16, device="cuda",
                             file_path_list=[
                                 "models/kolors/Kolors/text_encoder",
                                 "models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors",
                                 "models/kolors/Kolors/vae/diffusion_pytorch_model.safetensors"
                             ])
pipe = SDXLImagePipeline.from_model_manager(model_manager)

# Load LoRA
pipe.unet = load_lora(
    pipe.unet,
    lora_rank=16, # This parameter should be consistent with that in your training script.
    lora_alpha=2.0, # lora_alpha can control the weight of LoRA.
    lora_path="models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt"
)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
Loading models from: models/kolors/Kolors/text_encoder
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
    The following models are loaded: ['kolors_text_encoder'].
Loading models from: models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors
    model_name: sdxl_unet model_class: SDXLUNet
        This model is initialized with extra kwargs: {'is_kolors': True}
    The following models are loaded: ['sdxl_unet'].
Loading models from: models/kolors/Kolors/vae/diffusion_pytorch_model.safetensors
    model_name: sdxl_vae_encoder model_class: SDXLVAEEncoder
        This model is initialized with extra kwargs: {'upcast_to_float32': True}
    model_name: sdxl_vae_decoder model_class: SDXLVAEDecoder
        This model is initialized with extra kwargs: {'upcast_to_float32': True}
    The following models are loaded: ['sdxl_vae_encoder', 'sdxl_vae_decoder'].
/root/.conda/envs/kolor/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
No sdxl_text_encoder models available.
No sdxl_text_encoder_2 models available.
Using kolors_text_encoder from models/kolors/Kolors/text_encoder.
Using sdxl_unet from models/kolors/Kolors/unet/diffusion_pytorch_model.safetensors.
Using sdxl_vae_decoder from models/kolors/Kolors/vae/diffusion_pytorch_model.safetensors.
Using sdxl_vae_encoder from models/kolors/Kolors/vae/diffusion_pytorch_model.safetensors.
No sdxl_ipadapter models available.
No sdxl_ipadapter_clip_image_encoder models available.
Switch to Kolors. The prompter and scheduler will be replaced.
/tmp/ipykernel_124608/1083844865.py:14: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(lora_path, map_location="cpu")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.

开始生成图像:

torch.manual_seed(0)
image = pipe(
    prompt="二次元,一个紫色短发小女孩,在家中沙发上坐着,双手托着腮,很无聊,全身,粉色连衣裙",
    negative_prompt="丑陋、变形、嘈杂、模糊、低对比度",
    cfg_scale=4,
    num_inference_steps=50, height=1024, width=1024,
)
image.save("1.jpg")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.

此时,占用35.15GB内存,25GB显存。

Datawhale AI夏令营第四期魔搭-AIGC文生图方向Task1笔记_AIGC

得到的总共图像有:

Datawhale AI夏令营第四期魔搭-AIGC文生图方向Task1笔记_AIGC_02


注意的是,在使用 data-juicer 处理数据时,我这里说结合终端进行操作,在data_juicer_config.yaml的设置得到./data/data-juicer/output/result.jsonl

from modelscope.msdatasets import MsDataset

ds = MsDataset.load(
    'AI-ModelScope/lowres_anime',
    subset_name='default',
    split='train',
    cache_dir="/root/k2/AIGC/kolors/data"
)

##########使用 data-juicer 处理数据
data_juicer_config = """
# global parameters
project_name: 'data-process'
dataset_path: './data/data-juicer/input/metadata.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

text_keys: 'text'
image_key: 'image'
image_special_token: '<__dj__image>'

export_path: './data/data-juicer/output/result.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
    - image_shape_filter:
        min_width: 1024
        min_height: 1024
        any_or_all: any
    - image_aspect_ratio_filter:
        min_ratio: 0.5
        max_ratio: 2.0
        any_or_all: any
"""
with open("data/data-juicer/data_juicer_config.yaml", "w") as file:
    file.write(data_juicer_config.strip())
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.

Terminal Run

dj-process --config data/data-juicer/data_juicer_config.yaml
  • 1.

Datawhale AI夏令营第四期魔搭-AIGC文生图方向Task1笔记_AI_03

2024-08-09 01:03:47 | INFO     | data_juicer.config.config:618 - Back up the input config file [/root/k2/AIGC/kolors/data/data-juicer/data_juicer_config.yaml] into the work_dir [/root/k2/AIGC/kolors/data/data-juicer/output]
2024-08-09 01:03:47 | INFO     | data_juicer.config.config:640 - Configuration table: 
╒═════════════════════════╤═══════════════════════════════════════════════════════════════════════════════╕
│ key                     │ values                                                                        │
╞═════════════════════════╪═══════════════════════════════════════════════════════════════════════════════╡
│ config                  │ [Path_fr(data/data-juicer/data_juicer_config.yaml, cwd=/root/k2/AIGC/kolors)] │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ hpo_config              │ None                                                                          │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ data_probe_algo         │ 'uniform'                                                                     │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ data_probe_ratio        │ 1.0                                                                           │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ project_name            │ 'data-process'                                                                │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ executor_type           │ 'default'                                                                     │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ dataset_path            │ '/root/k2/AIGC/kolors/data/data-juicer/input/metadata.jsonl'                  │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ export_path             │ '/root/k2/AIGC/kolors/data/data-juicer/output/result.jsonl'                   │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size       │ 0                                                                             │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel      │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds    │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds   │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ np                      │ 4                                                                             │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ text_keys               │ 'text'                                                                        │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ image_key               │ 'image'                                                                       │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ image_special_token     │ '<__dj__image>'                                                               │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ audio_key               │ 'audios'                                                                      │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token     │ '<__dj__audio>'                                                               │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ video_key               │ 'videos'                                                                      │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ video_special_token     │ '<__dj__video>'                                                               │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token       │ '<|__dj__eoc|>'                                                               │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ suffixes                │ []                                                                            │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ use_cache               │ True                                                                          │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir            │ '/root/.cache/huggingface/datasets'                                           │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ cache_compress          │ None                                                                          │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint          │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ temp_dir                │ None                                                                          │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ open_tracer             │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace        │ []                                                                            │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ trace_num               │ 10                                                                            │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ op_fusion               │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ process                 │ [{'image_shape_filter': {'accelerator': None,                                 │
│                         │                          'any_or_all': 'any',                                 │
│                         │                          'audio_key': 'audios',                               │
│                         │                          'cpu_required': 1,                                   │
│                         │                          'image_key': 'image',                                │
│                         │                          'max_height': 9223372036854775807,                   │
│                         │                          'max_width': 9223372036854775807,                    │
│                         │                          'mem_required': 0,                                   │
│                         │                          'min_height': 1024,                                  │
│                         │                          'min_width': 1024,                                   │
│                         │                          'num_proc': 4,                                       │
│                         │                          'stats_export_path': None,                           │
│                         │                          'text_key': 'text',                                  │
│                         │                          'video_key': 'videos'}},                             │
│                         │  {'image_aspect_ratio_filter': {'accelerator': None,                          │
│                         │                                 'any_or_all': 'any',                          │
│                         │                                 'audio_key': 'audios',                        │
│                         │                                 'cpu_required': 1,                            │
│                         │                                 'image_key': 'image',                         │
│                         │                                 'max_ratio': 2.0,                             │
│                         │                                 'mem_required': 0,                            │
│                         │                                 'min_ratio': 0.5,                             │
│                         │                                 'num_proc': 4,                                │
│                         │                                 'stats_export_path': None,                    │
│                         │                                 'text_key': 'text',                           │
│                         │                                 'video_key': 'videos'}}]                      │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ percentiles             │ []                                                                            │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ export_original_dataset │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file  │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ ray_address             │ 'auto'                                                                        │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ debug                   │ False                                                                         │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ work_dir                │ '/root/k2/AIGC/kolors/data/data-juicer/output'                                │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ timestamp               │ '20240809010346'                                                              │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir             │ '/root/k2/AIGC/kolors/data/data-juicer/input'                                 │
├─────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ add_suffix              │ False                                                                         │
╘═════════════════════════╧═══════════════════════════════════════════════════════════════════════════════╛
2024-08-09 01:03:47 | INFO     | data_juicer.core.executor:47 - Using cache compression method: [None]
2024-08-09 01:03:47 | INFO     | data_juicer.core.executor:52 - Setting up data formatter...
2024-08-09 01:03:47 | INFO     | data_juicer.core.executor:74 - Preparing exporter...
2024-08-09 01:03:47 | INFO     | data_juicer.core.executor:151 - Loading dataset from data formatter...
Setting num_proc from 4 back to 1 for the jsonl split to disable multiprocessing as it only contains one shard.
Generating jsonl split: 1454 examples [00:00, 20076.04 examples/s]
2024-08-09 01:03:48 | INFO     | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-08-09 01:03:48 | INFO     | data_juicer.format.formatter:200 - There are 1454 sample(s) in the original dataset.
Filter (num_proc=4): 100%|#############################################################################################################| 1454/1454 [00:00<00:00, 4993.64 examples/s]
2024-08-09 01:03:49 | INFO     | data_juicer.format.formatter:214 - 1454 samples left after filtering empty text.
2024-08-09 01:03:49 | INFO     | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file)
Map (num_proc=4): 100%|################################################################################################################| 1454/1454 [00:00<00:00, 8079.14 examples/s]
2024-08-09 01:03:49 | INFO     | data_juicer.format.mixture_formatter:137 - sampled 1454 from 1454
2024-08-09 01:03:49 | INFO     | data_juicer.format.mixture_formatter:143 - There are 1454 in final dataset
2024-08-09 01:03:49 | INFO     | data_juicer.core.executor:157 - Preparing process operators...
2024-08-09 01:03:49 | INFO     | data_juicer.core.executor:164 - Processing data...
Adding new column for stats (num_proc=4): 100%|########################################################################################| 1454/1454 [00:00<00:00, 8337.08 examples/s]
image_shape_filter_compute_stats (num_proc=4): 100%|####################################################################################| 1454/1454 [00:08<00:00, 162.91 examples/s]
image_shape_filter_process (num_proc=4): 100%|#########################################################################################| 1454/1454 [00:00<00:00, 8087.59 examples/s]
2024-08-09 01:03:59 | INFO     | data_juicer.core.data:193 - OP [image_shape_filter] Done in 9.538s. Left 129 samples.
image_aspect_ratio_filter_compute_stats (num_proc=4): 100%|###############################################################################| 129/129 [00:01<00:00, 123.66 examples/s]
image_aspect_ratio_filter_process (num_proc=4): 100%|#####################################################################################| 129/129 [00:00<00:00, 799.52 examples/s]
2024-08-09 01:04:00 | INFO     | data_juicer.core.data:193 - OP [image_aspect_ratio_filter] Done in 1.372s. Left 129 samples.
2024-08-09 01:04:00 | INFO     | data_juicer.core.executor:171 - All OPs are done in 10.922s.
2024-08-09 01:04:00 | INFO     | data_juicer.core.executor:174 - Exporting dataset to disk...
2024-08-09 01:04:00 | INFO     | data_juicer.core.exporter:111 - Exporting computed stats into a single file...
Creating json from Arrow format: 100%|################################################################################################################| 1/1 [00:00<00:00, 10.24ba/s]
2024-08-09 01:04:00 | INFO     | data_juicer.core.exporter:140 - Export dataset into a single file...
Creating json from Arrow format: 100%|################################################################################################################| 1/1 [00:00<00:00, 18.43ba/s]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.