图像分类pytorch-image-models-master代码目录解析
参考
https://fastai.github.io/timmdocs/training_modelEMA
https://blog.csdn.net/weixin_44396553/article/details/120901765
https://github.com/rwightman/pytorch-image-models
https://fastai.github.io/timmdocs/training
broken pipe https://blog.csdn.net/qq_26369907/article/details/99701006
https://www.cnblogs.com/jiangkejie/p/14965003.html
Pytorch Image Models (a.k.a. timm) has a lot of pretrained models and interface which allows using these models as encoders in smp, however, not all models are supported.
transformer models do not have features_only functionality implemented
some models do not have appropriate strides
https://zhuanlan.zhihu.com/p/469323798
名词介绍
缩写 | 全称 | 解释 |
---|---|---|
AdvProp | Adversarial Propagation | AdvProp is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. A method that learns from both clean images as well as adversarially modified images. Since clean images are derived from a different distribution as compared to adversarial images, the model needs to also use the batch statistics according to image source. If not, the model may not be effective in extracting accurate features. https://paperswithcode.com/paper/adversarial-examples-improve-image#code; https://resbyte.github.io/posts/2020/06/Adversarial-Prop/?msclkid=ca22832fb15511ecbd8c5c507c8acfab |
BatchNorm(BN) | Batch normalization | Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. can be implemented during training by calculating the mean and standard deviation of each input variable to a layer per mini-batch and using these statistics to perform the standardization. Alternately, a running average of mean and standard deviation can be maintained across mini-batches, but may result in unstable training. https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/?msclkid=40fc3fceb15711ecab625b99bab242c6 https://keras.io/api/layers/normalization_layers/batch_normalization/?msclkid=40fd1d9bb15711eca054521d10053fb0 |
AugMix | We propose AugMix, a data processing technique that mixes augmented images andenforces consistent embeddings of the augmented images, which results inincreased robustness and improved uncertainty calibration. timm also supports augmix with RandAugment and AutoAugment. https://github.com/google-research/augmix?msclkid=68e4d7b9b18d11eca899bc519f05d2e6 | |
SyncBN | Synchronized Batch Normalization (同步) | SyncBN is a type of batch normalization used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch. |
SGDR | Stochastic Gradient Descent with Warm Restarts |
文件说明
文件夹树形结构如下:
|— convert
| |— convert_from_mxnet
| |— convert_nest_flax
|— docs
| |— models
| | |— efficient.md
| | | …
| |— archived_changes.md
| |— changes.md
| |— index.md
| |— feature_extraction.md
| |— models.md
| |— results.md
| |— scripts.md
| |—training_hparam_examples.md
|— notebooks
|— results
|— tests
|— timm
| |— data
| | |— config.py : func_resolve_data_config()
| | |— dataset_factory.py
| | |— loader.py
| |— loss
| |— models
| |— optim
| | |— optim_factory.py
| |— scheduler
| | |— scheduler_factory.py
| |— utils
| | |— checkpoint_saver.py
| | |— metrics.py
|— avg_checkpoints.py
|— benchmark.py
|— clean_checkpoint.py
|— distributed_train.sh # Change the Python interpreter to Python 3.x in the scripts
|— train.py
参数
name | annotation | default or example | relation |
---|---|---|---|
data_dir | 自定义数据集的存放路径,create_dataset 时使用 | 例如 ’d:/image’(包含下级路径 ’d:/image/train’ 和 ’d:/image/valid’ | create_dataset |
dataset | create_dataset 时使用, 设置为 ‘Afolder/Bfolder’,定义怎么取数据的方式或取什么数据 | ‘torch/image_folder’ | create_dataset |
train-split | create_dataset 时使用, 赋值给参数split,设置数据集作为训练集使用 | ‘train’ | create_dataset |
val-split | create_dataset 时使用, 赋值给参数split,设置数据集作为测试集使用 | ‘validation’ | create_dataset |
class-map | dataset = ‘torch/…’ 或 ’tfds/…’ 时不需要 | 通过text file 或 dict 类别索引 | create_dataset |
dataset_download | create_dataset 时,如果要获取 pytorch 中的指定公共数据集,需要下载 | False | create_dataset |
input_size | 图片大小,自定义输入’1’,‘224’,‘224’,转译成 list 或 tuple,eg. [3, 224, 224] | None | create_loader |
img_size | 图片大小, int /timm/data/config.py/resolve_data_config 中, input_size 首先取用户定义的input_size, 其次 input_size = (3, args[‘img_size’], args[‘img_size’]) 最后取模型default_cfg 相关的 input_size | None | create_loader |
batch_size | train batch size | 128 | create_loader |
validation-batch-size | Validation batch size | default=None | create_loader |
crop_pct | float, Input image center crop percent (for validation only) resolve_data_config 中会去模型config中定义的值 | default=None | create_loader |
mean | Override mean pixel value of dataset loader.py 的 create_loader resolve_data_config 中会去模型config中定义的值 | default=None | create_loader |
std | Override std deviation of dataset loader.py 的 create_loader resolve_data_config 中会去模型config中定义的值 | default=None | create_loader |
interpolation | 'Image resize interpolation type (overrides model) resolve_data_config 中会去模型config中定义的值 | default=‘’ | create_loader |
train-interpolation | Training interpolation (random, bilinear, bicubic default: “random”) 有时取args.interpolation | ‘random’ | create_loader |
aug-splits | Number of augmentation splits (default: 0, valid: 0 or >=2), train.py 中赋值给了num_aug_splits. we passed in num_aug_splits=2. In this case, the loader_train has the first 8 original images and next 8 images that represent augmix1. Had we passed in num_aug_splits=3, then the effective batch_size would have been 24, where the first 8 images would have been the original images, next 8 representing augmix1 and the last 8 representing augmix2. | default=0 | loss augmentation |
local_rank | 主要是写 log | default=0 | |
use-multi-epochs-loader | use the /timm/data/loader.py/MultiEpochsDataLoader to save time at the beginning of every epoch | default=False | |
log-wandb | log training and validation metrics to wandb | default=False | |
no-prefetcher | train.py 中,prefetcher = not no-prefetcher | default=False,means disable fast prefetcher | |
no_aug | when True, disable all training augmentation | False | create_loader |
reprob | Random erase prob, loader.py 的 create_loader 中 reprob if is_training and not no_aug else 0. | default=0 | |
remode | Random erase mode loader.py 的 create_loader | default=‘pixel’ | |
recount | Random erase count loader.py 的 create_loader | default=1 | |
resplit | Do not random erase first (clean) augmentation split loader.py 的 create_loader | default=False | |
mixup | mixup alpha, mixup enabled if > 0 | default=0.0 | loss augmentation |
cutmix | cutmix alpha, cutmix enabled if > 0 | default=0.0 | loss augmentation |
cutmix-minmax | cutmix min/max ratio, overrides alpha and enables cutmix if set | default=None | loss augmentation |
smoothing | Label smoothing | default=0.1 | loss augmentation |
jsd-loss | Enable Jensen-Shannon Divergence + CE loss. Use with aug-splits | default=False | loss |
bce-loss | Enable BCE loss w/ Mixup/CutMix use | default=False | loss |
bce-target-thresh | Threshold for binarizing softened BCE targets (default: None, disabled) | default=None | loss |
mixup-off-epoch | Turn off mixup after this epoch, disabled if 0 | default=0 | train |
log_interval | how many batches to wait before logging training status 不能为0,设置了第几个batch写一次log, 计算一次指标平均值 | default=50 | |
save_images | save images of input bathes every log interval for debugging | default=False | |
recovery_interval | how many batches to wait before writing recovery checkpoint | default=0 | |
amp | use NVIDIA Apex AMP or Native AMP for mixed precision training | default=False | mixed precision training |
apex-amp | Use NVIDIA Apex AMP mixed precision | default=False | mixed precision training |
ative-amp | Use Native Torch AMP mixed precision | default=False | mixed precision training |
checkpoint-hist | number of checkpoints to keep default=10 | saver | |
eval-metric | 验证函数 validate 的输出中评价指标之一的 key,decreasing = True if eval_metric == ‘loss’ else False,decreasing 用在 saver 中做 排序sort, 是否要倒序排列,所以需要与判断值要包含在输出中 | default=‘top1’ | saver |
experiment | 1. 使用 wandb 时创建的项目名称 2. 训练结果输出路径的 sub-folder 文件夹名称 | default=‘’ | wandb |
model-ema | Enable tracking moving average of model weights, 是否使用ema | default=False | ema |
model-ema-force-cpu | Force ema to be tracked on CPU, rank=0 node only. Disables EMA validation. | default=False | ema |
model-ema-decay | decay factor for model weights moving average (default: 0.9998) | default=0.9998 | ema |
output | path to output folder (default: none, current dir) 与变量output_dir 有关, =output 或 ‘./output/train’,+时间+experiment+img.weight) | default=‘’ | |
channels-last | Use channels_last memory layout | default=False | model |
model | Name of model to train | default=‘resnet50’ | model |
pretrained | Start with pretrained version of specified network (if avail) ,对应 create_model 的 pretrained | default=False | model |
initial-checkpoint | 完整路径文件 Initialize model from this checkpoint, 对应 create_model 的 checkpoint_path | default=‘’ | model |
num-classes | number of label classes Model must have num_classes attr if not set on cmd line/config. | default=None | model |
epochs | int, number of epochs to train | 300 | train |
epoch-repeats | float, epoch repeat multiplier (number of times to repeat dataset epoch per train epoch) | 0. | train |
start-epoch | int, manual epoch number (useful on restarts) 如果None,赋值为0 或resume 对应的epoch+1 | default=None | 接着训练 |
resume | 完整路径文件 Resume full model and optimizer state from checkpoint helpers.py/resume_checkpoint, 与 initial-checkpoint 有相似处 | default=‘’ | 接着训练 |
no-resume-opt | prevent resume of optimizer state when resuming model | False | 接着训练 |
torchscript | convert model torchscript for inference | model | |
drop | Dropout rate | default=0.0 | model |
gp | Global pool type, one of (fast, avg, max, avgmax, avgmaxc). Model default if None. | default=None | model |
opt | Optimizer | default=‘sgd’ | opt |
opt-eps | float, Optimizer Epsilon | default=None | opt |
opt-betas | float, Optimizer Betas | default=None | opt |
lr | learning rate | default=0.05 | opt Learning rate schedule |
momentum | Optimizer momentum | default=0.9 | opt |
weight-decay | weight decay | default=2e-5 | opt |
clip-grad | Clip gradient norm (default: None, no clipping) | default=None | opt |
clip-mode | Gradient clipping mode. One of (“norm”, “value”, “agc”) | default=‘norm’ | opt |
sched | str, LR scheduler 学习率下降 | ‘cosine’ | create_scheduler |
decay-rate | float, LR decay rate 衰减率 | 0.1 | create_scheduler |
warmup-lr | float, warmup learning rate 先从该值开始上升再开始衰减 | 0.0001 | create_scheduler |
min-lr | float, lower lr bound for cyclic schedulers that hit 0 (1e-5) | 1e-6 | create_scheduler |
decay-epochs | float, epoch interval to decay LR | 100 | create_scheduler |
warmup-epochs | int, epochs to warmup LR, if scheduler supports | 3 | create_scheduler |
cooldown-epochs | int, epochs to cooldown LR at min_lr, after cyclic schedule ends | 10 | create_scheduler |
patience-epochs | int, patience epochs for Plateau LR scheduler,迭代超过该次数但loss 不减小就降低lr | 10 | create_scheduler |
lr-noise | float, learning rate noise on/off epoch percentages | default=None | create_scheduler |
lr-noise-pct | float, learning rate noise limit percent | 0.67 | create_scheduler |
lr-noise-std | float, learning rate noise std-dev | 1.0 | create_scheduler |
lr-cycle-mul | float, learning rate cycle len multiplier | 1.0 | create_scheduler |
lr-cycle-decay | float, amount to decay each learning rate cycle | 0.5 | create_scheduler |
lr-cycle-limit | int, learning rate cycle limit, cycles enabled if > 1 | 1 | create_scheduler |
lr-k-decay | float, learning rate k-decay for cosine/poly | 1.0 | create_scheduler |
tta | Test/inference time augmentation (oversampling) factor. 0=None (default: 0) | default=0 | |
checkpoint_hist | type=int, number of checkpoints to keep epoch 次训练后,最后存几组模型参数和训练结果 | default=10 | |
需要自行添加的变量 | |||
distributed | Enable tracking moving average of model weights | default=False | distribute |
world_size | when distributed, = torch.distributed.get_world_size(), means total processes, 1 GPU per process. | default=1 | distribute |
rank | when distributed, = torch.distributed.get_rank() | default=1 | distribute |
hflip | 训练集 transforms.RandomHorizontalFlip | 0.5 | create_loader |
vflip | 训练集 transforms.RandomVerticalFlip | 0 | create_loader |
color_jitter | 训练集 transforms.ColorJitter | 0.4 | create_loader |
1. convert
文件夹里主要包含mxnet和flax,其功能分别是从mxnet和nest_flax预训练模型到pytorch模型的转换,其原因是预训练模型在不同深度学习框架中的转换是一种常见的任务。
2. docs
文件夹里主要是一些文档的.md文件,其中包括各种模型的参数,使用步骤,使用要求,代码来源以及在ImageNet上的Top1和Top5识别准确率。如果有需要使用某一个模型,可以在docs文件夹里查找相关信息。
2.1. index.md,timm 入门
2.2. feature_extraction.md
2.3. models.md 介绍各种模型的参考paper
2.4. results.md 各种模型在 ImageNet 的应用准确度展示
2.5. scripts.md 和 training_hparam_examples.md 都是通过参数设定调用脚本举例
2.6. docs 具体每个模型通过load 去使用的代码举例
3. notebooks
作者复现模型的一些代码,笔记,jupyter notebook
4. results
文件夹主要是放置一些结果文件
5. test
文件夹是一些测试程序,包括对于层数,模型,优化器,工具类的测试。读者可以根据自己的需要,在这个文件夹里,测试自己模型的参数,预测最好的效果。
6. timm
6.1. data
主要包含图片参数设置,可以导入路径,tar压缩包等,除此之外,还有对图片进行预处理的操作:自动增强,transforms结构等,可以使网络权重更加精确,识别效果更佳优秀。transform结构最开始来源于NLP,因为其self-attention的机制,应用于图片处理,效果也是很好,所以该结构在深度学习中,极受欢迎。
6.1.1. auto_augment.py
主要用到PIL
6.1.2. config.py
resolve_data_config(args, default_cfg={}, model=None, use_test_size=False, verbose=False)
生成 new_config = {},包含 input/image size 默认(3, 224, 224), interpolation method, mean and std deviation for normalization, crop percentage(验证集图片中心裁剪的比列)
6.1.3. dataset_factory.py 的 create_dataset
参数
name,
root,
split=‘validation’,
search_split=True,
class_map=None,
load_bytes=False,
is_training=False,
download=False,
batch_size=None,
repeats=0,
**kwargs
name 设置为 ‘Afolder/Bfolder’,定义怎么取数据的方式或取什么数据
- 当name.lower().startswith(‘torch/’)
1.1 当Bfolder 是 from torchvision.datasets import CIFAR100, CIFAR10, MNIST, QMNIST, KMNIST, FashionMNIST 其中之一,可以取到公共数据集
1.2 当Bfolder 是 imagenet, 另外需要 split 是val 验证或测试,可以获得 ImageNet
1.2 当Bfolder 是 image_folder 或 folder 就使用torchvision.datasets.ImageFolder 获取数据集,还需要root 即路径,此时必须是dir, 会判断os.path.isdir(root),不需要精确到用途这一层,当split 指定用途且search_split=True,会自动寻找是否有’root/train’ 或者’root/training’ ,测试模式同理 - name.lower().startswith(‘tfds/’)
- 其他,使用 timm/data/dataset.py 的ImageDataset取数据,暂不说明
split 定义了dataset 的用途,有两种
_TRAIN_SYNONYM = {‘train’, ‘training’}
_EVAL_SYNONYM = {‘val’, ‘valid’, ‘validation’, ‘eval’, ‘evaluation’},
取其中一个值
所以在当前任务中,需要以下参数设定
name=‘torch/image_folder’,
root,
split=‘validation’,
search_split=True,
is_training=False
6.1.4. loader.py 的 create_loader
基础用 loader =torch.utils.data.DataLoader,默认使用PrefetchLoader,当 use_multi_epochs_loader = True, 嵌套 MultiEpochsDataLoader,当 prefetcher = True, 嵌套
PrefetchLoader(loader,
mean=mean,
std=std,
channels=input_size[0],
fp16=fp16,
re_prob=prefetch_re_prob,
re_mode=re_mode,
re_count=re_count,
re_num_splits=re_num_splits,与resplit, num_aug_splits 有关
)
Attention
torch.utils.data.DataLoader 的 num_worker 参数容易报错 errno 32 broken pipe,是因为 Pytorch 在 win10 中暂不支持多线程导致的 bug,把 num_worker 改为0 即可解决
train_loader = torch.utils.data.DataLoader( trainData, batch_size=40, shuffle=True,
num_workers=0, # 在此处,把num_workers设为0
)
但是
还会报错
ValueError: persistent_workers option needs num_workers > 0
是因为 create_loader()默认参数 persistent_workers=True, persistent_workers 也是给 torch.utils.data.DataLoader 定义的,好处是Epoch之间不必重复关闭启动worker进程,加快训练速度,但是 persistent_workers=True 与 num_workers=0 冲突。所以需要将create_loader(persistent_workers=False).
persistent_workers (bool, optional) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False) (如果为True,数据加载器将不会在数据集运行完一个Epoch后关闭worker进程。这允许维护worker数据集实例保持激活, 继续进行下一个Epoch的数据加载。)
6.1.5. real_labels.py
class RealLabelsImagenet, func add_result(self, output), get_accuracy(self, k=None)
6.1.6. transforms_factory.py 的 create_transform
如果 is_training=True and no_aug =True, 缩放+中心裁剪
如果 is_training=True and no_aug =False,默认依赖 auto_augment,hflip,color_jitter 参数进行处理
6.2. loss
包中含有多标签之非对称损失函数,解决了多标签分类任务中,正负样本不平衡问题,标签错误问题。该方法,高效,容易使用。相比于最近的其他方法,该方法基于主流的网络结构,并且不需要其他的信息。当然,还有其他损失函数,比如binary_cross_entropy,cross_entropy, jsd 等等。train.py 默认 LabelSmoothingCrossEntropy
6.3. models
模型load 的函数,类似torchvision.models
- models/factory.py/create_model
参数 | annotation |
---|---|
Model_name (str) | 要实例化的模型的名称 |
pretrained (bool) | 如果为真,则加载预训练的ImageNet-1k权重 |
checkpoint_path (str) | 模型初始化后要加载的检查点路径 checkpoint_path 非空,models/helpers.py/load_checkpoint(model, checkpoint_path, use_ema=False, strict=True), 其中获得了已保存模型训练的参数键值 state_dict = load_state_dict(checkpoint_path, use_ema) 再加载给模型model.load_state_dict(state_dict, strict=strict), 与pretrained 无关 |
Scriptable (bool) | 设置层配置,使模型是jit脚本化的(尚未对所有模型工作) |
exportable (bool) | 设置图层配置,使模型是可跟踪的/ ONNX可导出的(尚未完全impl/服从) |
no_jit (bool) | 设置图层配置,这样模型就不会使用jit脚本化的图层(到目前为止只使用激活) |
drop_rate(浮动) | 训练的退出率(默认:0.0) |
global_pool (str) | 全局池类型(默认为’avg’) |
- models/helpers.py,
2.1. 接着训练resume_checkpoint(model, checkpoint_path, optimizer=None, loss_scaler=None, log_info=True),在train.py 使用时,498 行的 lr_scheduler.step(start_epoch) 会报“metric" 不能为None 的错,可以修改resume_checkpoint输出属性 metric.
2.2. 用自训练的参数初始化模型,重新训练 load_checkpoint(model, checkpoint_path, use_ema=False, strict=True),其中helpers.py/load_state_dict(checkpoint_path, use_ema=False) 是获取checkpoint_path保存的模型相关参数,model.load_state_dict 是把参数带入到模型
/layers/test_time_pool.py
apply_test_time_pool(model, config, use_test_size=True) -> model, test_time_pool(True, false)
6.4. optim
包里是一些优化器的选择,优化器的作用是自动设置权重步长. 给定了 lr, 返回loss?
optim_factory.py
create_optimizer_v2(
model_or_params,
opt: str = ‘sgd’,
lr: Optional[float] = None,
weight_decay: float = 0.,
momentum: float = 0.9,
filter_bias_and_bn: bool = True,
layer_decay: Optional[float] = None,
param_group_fn: Optional[Callable] = None,
**kwargs):
Create an optimizer.
TODO currently the model is passed in and all parameters are selected for optimization.
For more general use an interface that allows selection of parameters to optimize and lr groups, one of:
* a filter fn interface that further breaks params into groups in a weight_decay compatible fashion
* expose the parameters interface and leave it up to caller
Args:
model_or_params (nn.Module): model containing parameters to optimize
opt: name of optimizer to create
lr: initial learning rate
weight_decay: weight decay to apply in optimizer
momentum: momentum for momentum based optimizers (others may use betas via kwargs)
filter_bias_and_bn: filter out bias, bn and other 1d params from weight decay
**kwargs: extra optimizer specific kwargs to pass through
Returns:
Optimizer
6.5. scheduler
主要是学习率的设置, 与内置的PyTorch调度程序不同,它的目的是在每个epoch结束时,在递增的epoch计数之前,一致地调用它来计算下一个epoch的值;在每个优化器更新结束时,在递增的更新计数之后,计算下一个更新的值。(Unlike the builtin PyTorch schedulers, this is intended to be consistently called at the END of each epoch, before incrementing the epoch count, to calculate next epoch’s value & at the END of each optimizer update, after incrementing the update count, to calculate next update’s value.) 所以,在训练中每个batch 结束后有 lr_scheduler.step_update(num_updates=num_updates, metric=losses_m.avg),在每个epoch 后有 lr_scheduler.step(epoch + 1, eval_metrics[eval_metric]) 。这也解释了lr_scheduler.step_update 与 lr_scheduler.step 的作用。
scheduler_factory.py 定义了 lr_scheduler, num_epochs = create_scheduler(args, optimizer),其他都是学习率的方法有以下几种
lr_scheduler | num_epochs | 底层 | 使用 |
---|---|---|---|
from .cosine_lr import CosineLRScheduler | num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs | the SGDR scheduler also referred to as the cosine scheduler in timm | |
from .multistep_lr import MultiStepLRScheduler | num_epochs = args.epochs | ||
from .plateau_lr import PlateauLRScheduler | num_epochs = args.epochs | This scheduler is very similar to PyTorch’s ReduceLROnPlateau scheduler. The basic idea is to track an eval metric and based on the evaluation metric’s value, the lr is reduced using StepLR if the eval metric is stagnant for a certain number of epochs. | Decay the LR by a factor every time the validation loss plateaus. The PlateauLRScheduler by default tracks the eval-metric which is by default top-1 in the timm training script. If the performance plateaus, then the new learning learning after a certain number of epochs (by default 10) is set to lr * decay_rate. This scheduler underneath uses PyTorch’s ReduceLROnPlateau. |
from .poly_lr import PolyLRScheduler | num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs | ||
from .step_lr import StepLRScheduler | num_epochs = args.epochs | The StepLR is a basic step LR schedule with warmup, noise. PyTorch’s implementation does not support warmup or noise. After a certain number decay_epochs, the learning rate is updated to be lr * decay_rate. | |
from .tanh_lr import TanhLRScheduler | num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs | Stochastic Gradient Descent with Hyperbolic-Tangent Decay on Classification. This is also referred to as the tanh annealing. tanh stands for hyperbolic tangent decay. |
当 lr_cycle_limit = 1 且 lr_cycle_mul =1,lr_scheduler.get_cycle_length() = args.epochs.
timm 的LRScheduler 都继承了 torch 的 Scheduler,也有了属性 Scheduler.step 和 Scheduler.step_update
Args of PlateauLRScheduler contrasted in timm with in pytorch
TIMM | PyTorch |
---|---|
patience_t | patience |
decay_rate | factor |
verbose | verbose |
threshold | threshold |
cooldown_t | cooldown |
mode | mode |
lr_min | min_lr |
当 args.sched == ‘plateau’,只有eval_metric = “loss”, mode 才是”min", 否则mode = ”max" ;PlateauLRScheduler 默认也是"max"
PlateauLRScheduler(
optimizer,
decay_rate=0.1,
patience_t=10,
verbose=True,
threshold=1e-4,
cooldown_t=0,
warmup_t=0,
warmup_lr_init=0,
lr_min=0,
mode='max',
noise_range_t=None,
noise_type='normal',
noise_pct=0.67,
noise_std=1.0,
noise_seed=None,
initialize=True,
6.6. utils
包里是一些ResNet和MobileNet网络的一些工具类,主要还是为网络结构服务
6.6.1. checkpoint_saver.py
class CheckpointSaver(
model=model, optimizer=optimizer, args=args, model_ema=model_ema, amp_scaler=loss_scaler,
checkpoint_dir=output_dir, recovery_dir=output_dir, decreasing=decreasing, max_history=args.checkpoint_hist),
func save_checkpoint, _save, _cleanup_checkpoints, save_recovery, find_recovery
save_checkpoint(self, epoch, metric=None)
- epoch 次迭代,最终只能有10组最优的模型参数和结果被保存,使用 checkpoint_files.append((save_file_path, metric)),有更优的出现就会替换已有的最差的结果,并且排序,
- 排序规则按照 metric 升序或降序, train.py 中就是平均“loss",reverse=not decreasing, 即 decreasing=True, 最后选择的是使得loss 最小的10组参数,记入log中
- 统一了保存的命名规则,‘checkpoint’ + str(epoch) + ‘.pth.tar’
- return (None, None) if self.best_metric is None else (self.best_metric, self.best_epoch),但不确定是不是指 10 结果里 最好的metric 及对应的epoch
_save(self, save_path, epoch, metric=None)
使用 torch.save(), 记录了
epoch
model
args
get_state_dict(self.model, self.unwrap_fn),
optimizer.state_dict(),
‘version’
amp_scaler.state_dict()
get_state_dict(self.model_ema, self.unwrap_fn)
metric
_cleanup_checkpoints
定义了删除多余checkpoint 的规则方法, checkpoint_files 排序以后index>=10 的都删掉
save_recovery
保存某次epoch, 某batch 后的模型参数结果
6.6.2. summary.py
-
get_outdir(path, *paths, inc=False), 设置output_dir,
inc=True, 计数方法定义文件夹名称,
inc=False,且不存在path*paths 的路径就创建一个,反之就直接使用,
train.py 中,*paths = experiment, 若experiment 非默认 ‘’, -
update_summary(epoch, train_metrics, eval_metrics, filename, write_header=False, log_wandb=False),train,valid 结果写入csv
6.6.3. log.py
setup_default_logging(default_level=logging.INFO, log_path=‘’)
6.6.4. metrics.py
定义了class AverageMeter,accuracy
accuracy(output, target, topk=(1,))
参数:
output 为预测概览矩阵,大小为batch_size * num(label),
target 为实际 label 矩阵,大小为 1* batch_size
topk一般设为(1, n), 表示最多取预测矩阵的前 n 个最大值,当 n > num(label) 也没关系,maxk = min(max(topk), output.size()[1]) 会处理,因为函数中
, pred = output.topk(maxk, 1, True, True),实际最多取预测矩阵的前 maxk 个最大值,其中 "" 是 预测概率,pred 是索引,即表示预测出的 label
输出:
本 batch 中 最大预测概率对应的索引就是正确 label 的样本数量占比,本 batch 中,前 maxk 个最大预测概率对应的索引包含正确 label 的样本数量占比。
6.6.5. misc.py
natural_key(string_) 给文档排序sorted 的key 赋值
7. avg_checkpoint.py
作用是匹配指定路径上的所有模型权重的过滤器通配符。为了取得较好的结果,这些checkpoint必须来源于相同训练。
8 benchmark.py
是timm模型的推理和训练步骤基准的脚本。
9 inference.py
是一个示例推理脚本,将文件夹中的图像的top-k类id输出到csv中。
10 train.py
这是一个精简的、易于修改的ImageNet训练脚本,可以重新生成ImageNet,训练结果与一些最新的网络和训练技术。它倾向于规范的PyTorch和标准的Python风格,也就是说,提供了很多的训练速度和改进结果的PyTorch示例脚本,自己可以自由选择。
参数
These arguments are to define Dataset/Model parameters, Optimizer parameters, Learnining Rate scheduler parameters, Augmentation and regularization, Batch Norm parameters, Model exponential moving average parameters, and some miscellaneaous parameters such as --seed, --tta etc.
Do note that some random augmentations are set by default such as color_jitter, hfliip but there is a parameter no-aug in case you wanted to turn of all training data augmentations. Also, the default optimizer opt is ’sgd’ but it is possible to change that. timm offers a vast number of optimizers to train your models with.
Column 1 | Column 2 |
---|---|
–aa | Auto-Augment |
运行
Distributed Training on multiple GPUs
To train models on multiple GPUs, simply replace python train.py with ./distributed_train.sh like so:
./distributed_train.sh 4 ./imagenette2-320 --aug-splits 3 --jsd
This trains the model using AugMix data augmentation on 4 GPUs.
step and args
- loss function
args.jsd_loss, args.aug_splits, args.smoothing
args.mixup, args.cutmix, args.cutmix_minmax, args.bce_loss, args.bce_target_thresh - 分布式计算
args.local_rank:gpu_id, 为int类型变量,只能指定一张显卡,默认0
另一个作用,当 args.local_rank== 0,写 log,有时save train_batch_image - train
- 用不用wandb, 首先要装载wandb,使得has_wandb =True,还涉及 args.log_wandb, 默认是False
- augmentation,mixup,cutmix
- Attention
当使用混合精度时,train_one_epoch 与 validate 会报错,
Error: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
因为没有把input 放到gpu, 若 args.prefetcher =False , 需要将
if not args.prefetcher:
改成
if args.prefetcher:
11 vaildate.py
是一个精简且易于修改的ImageNet验证脚本,其功能与train.py类似。
11.1. 参数
name | annotation | default or example | relation |
---|---|---|---|
valid-labels | 一个 .txt 文件的路径,里面放着需要验证的部分或全部标签索引,一行一个, 内容会赋值给变量valid_labels, output = model(input) output = output[:, valid_labels] | default=‘’ | |
real-labels | 路径, Real labels JSON file for imagenet evaluation | default=‘’ |
notice
要使用 from dataset_factory_joyce import create_dataset_joyce,只能使 create_loader 的 use_prefetcher=args.no_prefetcher=False
11.2. debug
validate.py from timm.utils import set_jit_fuser 报错,改成 from timm.utils import * 可以正常运行
Q
- distribute = TRUE, device 也只是一个cuda: local_rank, 怎么就分布式了?
- wandb 没用起来
解决 3. what is cnDNN,