图像分类pytorch-image-models-master代码目录解析

参考

https://fastai.github.io/timmdocs/training_modelEMA
https://blog.csdn.net/weixin_44396553/article/details/120901765
https://github.com/rwightman/pytorch-image-models
https://fastai.github.io/timmdocs/training
broken pipe https://blog.csdn.net/qq_26369907/article/details/99701006
https://www.cnblogs.com/jiangkejie/p/14965003.html
Pytorch Image Models (a.k.a. timm) has a lot of pretrained models and interface which allows using these models as encoders in smp, however, not all models are supported.
transformer models do not have features_only functionality implemented
some models do not have appropriate strides

https://zhuanlan.zhihu.com/p/469323798

名词介绍

缩写全称解释
AdvPropAdversarial PropagationAdvProp is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting.
A method that learns from both clean images as well as adversarially modified images. Since clean images are derived from a different distribution as compared to adversarial images, the model needs to also use the batch statistics according to image source. If not, the model may not be effective in extracting accurate features.
https://paperswithcode.com/paper/adversarial-examples-improve-image#code;
https://resbyte.github.io/posts/2020/06/Adversarial-Prop/?msclkid=ca22832fb15511ecbd8c5c507c8acfab
BatchNorm(BN)Batch normalizationBatch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. can be implemented during training by calculating the mean and standard deviation of each input variable to a layer per mini-batch and using these statistics to perform the standardization. Alternately, a running average of mean and standard deviation can be maintained across mini-batches, but may result in unstable training.
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/?msclkid=40fc3fceb15711ecab625b99bab242c6
https://keras.io/api/layers/normalization_layers/batch_normalization/?msclkid=40fd1d9bb15711eca054521d10053fb0
AugMixWe propose AugMix, a data processing technique that mixes augmented images andenforces consistent embeddings of the augmented images, which results inincreased robustness and improved uncertainty calibration.
timm also supports augmix with RandAugment and AutoAugment.
https://github.com/google-research/augmix?msclkid=68e4d7b9b18d11eca899bc519f05d2e6
SyncBNSynchronized Batch Normalization (同步)SyncBN is a type of batch normalization used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch.
SGDRStochastic Gradient Descent with Warm Restarts

文件说明

文件夹树形结构如下:
|— convert
| |— convert_from_mxnet
| |— convert_nest_flax
|— docs
| |— models
| | |— efficient.md
| | | …
| |— archived_changes.md
| |— changes.md
| |— index.md
| |— feature_extraction.md
| |— models.md
| |— results.md
| |— scripts.md
| |—training_hparam_examples.md
|— notebooks
|— results
|— tests
|— timm
| |— data
| | |— config.py : func_resolve_data_config()
| | |— dataset_factory.py
| | |— loader.py
| |— loss
| |— models
| |— optim
| | |— optim_factory.py
| |— scheduler
| | |— scheduler_factory.py
| |— utils
| | |— checkpoint_saver.py
| | |— metrics.py
|— avg_checkpoints.py
|— benchmark.py
|— clean_checkpoint.py
|— distributed_train.sh # Change the Python interpreter to Python 3.x in the scripts
|— train.py

参数

nameannotationdefault or examplerelation
data_dir自定义数据集的存放路径,create_dataset 时使用例如 ’d:/image’(包含下级路径 ’d:/image/train’ 和 ’d:/image/valid’create_dataset
datasetcreate_dataset 时使用, 设置为 ‘Afolder/Bfolder’,定义怎么取数据的方式或取什么数据‘torch/image_folder’create_dataset
train-splitcreate_dataset 时使用, 赋值给参数split,设置数据集作为训练集使用‘train’create_dataset
val-splitcreate_dataset 时使用, 赋值给参数split,设置数据集作为测试集使用‘validation’create_dataset
class-mapdataset = ‘torch/…’ 或 ’tfds/…’ 时不需要通过text file 或 dict 类别索引create_dataset
dataset_downloadcreate_dataset 时,如果要获取 pytorch 中的指定公共数据集,需要下载Falsecreate_dataset
input_size图片大小,自定义输入’1’,‘224’,‘224’,转译成 list 或 tuple,eg. [3, 224, 224]Nonecreate_loader
img_size图片大小, int
/timm/data/config.py/resolve_data_config 中, input_size 首先取用户定义的input_size,
其次 input_size = (3, args[‘img_size’], args[‘img_size’])
最后取模型default_cfg 相关的 input_size
Nonecreate_loader
batch_sizetrain batch size128create_loader
validation-batch-sizeValidation batch sizedefault=Nonecreate_loader
crop_pctfloat, Input image center crop percent (for validation only)
resolve_data_config 中会去模型config中定义的值
default=Nonecreate_loader
meanOverride mean pixel value of dataset
loader.py 的 create_loader
resolve_data_config 中会去模型config中定义的值
default=Nonecreate_loader
stdOverride std deviation of dataset
loader.py 的 create_loader
resolve_data_config 中会去模型config中定义的值
default=Nonecreate_loader
interpolation'Image resize interpolation type (overrides model)
resolve_data_config 中会去模型config中定义的值
default=‘’create_loader
train-interpolationTraining interpolation (random, bilinear, bicubic default: “random”)
有时取args.interpolation
‘random’create_loader
aug-splitsNumber of augmentation splits (default: 0, valid: 0 or >=2), train.py 中赋值给了num_aug_splits. we passed in num_aug_splits=2. In this case, the loader_train has the first 8 original images and next 8 images that represent augmix1. Had we passed in num_aug_splits=3, then the effective batch_size would have been 24, where the first 8 images would have been the original images, next 8 representing augmix1 and the last 8 representing augmix2.default=0loss
augmentation
local_rank主要是写 logdefault=0
use-multi-epochs-loaderuse the /timm/data/loader.py/MultiEpochsDataLoader to save time at the beginning of every epochdefault=False
log-wandblog training and validation metrics to wandbdefault=False
no-prefetchertrain.py 中,prefetcher = not no-prefetcherdefault=False,means disable fast prefetcher
no_augwhen True, disable all training augmentationFalsecreate_loader
reprobRandom erase prob,
loader.py 的 create_loader 中 reprob if is_training and not no_aug else 0.
default=0
remodeRandom erase mode
loader.py 的 create_loader
default=‘pixel’
recountRandom erase count
loader.py 的 create_loader
default=1
resplitDo not random erase first (clean) augmentation split
loader.py 的 create_loader
default=False
mixupmixup alpha, mixup enabled if > 0default=0.0loss
augmentation
cutmixcutmix alpha, cutmix enabled if > 0default=0.0loss
augmentation
cutmix-minmaxcutmix min/max ratio, overrides alpha and enables cutmix if setdefault=Noneloss
augmentation
smoothingLabel smoothingdefault=0.1loss
augmentation
jsd-lossEnable Jensen-Shannon Divergence + CE loss. Use with aug-splitsdefault=Falseloss
bce-lossEnable BCE loss w/ Mixup/CutMix usedefault=Falseloss
bce-target-threshThreshold for binarizing softened BCE targets (default: None, disabled)default=Noneloss
mixup-off-epochTurn off mixup after this epoch, disabled if 0default=0train
log_intervalhow many batches to wait before logging training status
不能为0,设置了第几个batch写一次log, 计算一次指标平均值
default=50
save_imagessave images of input bathes every log interval for debuggingdefault=False
recovery_intervalhow many batches to wait before writing recovery checkpointdefault=0
ampuse NVIDIA Apex AMP or Native AMP for mixed precision trainingdefault=Falsemixed precision training
apex-ampUse NVIDIA Apex AMP mixed precisiondefault=Falsemixed precision training
ative-ampUse Native Torch AMP mixed precisiondefault=Falsemixed precision training
checkpoint-histnumber of checkpoints to keep default=10saver
eval-metric验证函数 validate 的输出中评价指标之一的 key,decreasing = True if eval_metric == ‘loss’ else False,decreasing 用在 saver 中做 排序sort, 是否要倒序排列,所以需要与判断值要包含在输出中default=‘top1’saver
experiment1. 使用 wandb 时创建的项目名称
2. 训练结果输出路径的 sub-folder 文件夹名称
default=‘’wandb
model-emaEnable tracking moving average of model weights, 是否使用emadefault=Falseema
model-ema-force-cpuForce ema to be tracked on CPU, rank=0 node only. Disables EMA validation.default=Falseema
model-ema-decaydecay factor for model weights moving average (default: 0.9998)default=0.9998ema
outputpath to output folder (default: none, current dir)
与变量output_dir 有关, =output 或 ‘./output/train’,+时间+experiment+img.weight)
default=‘’
channels-lastUse channels_last memory layoutdefault=Falsemodel
modelName of model to traindefault=‘resnet50’model
pretrainedStart with pretrained version of specified network (if avail) ,对应 create_model 的 pretraineddefault=Falsemodel
initial-checkpoint完整路径文件
Initialize model from this checkpoint, 对应 create_model 的 checkpoint_path
default=‘’model
num-classesnumber of label classes
Model must have num_classes attr if not set on cmd line/config.
default=Nonemodel
epochsint, number of epochs to train300train
epoch-repeatsfloat, epoch repeat multiplier (number of times to repeat dataset epoch per train epoch)0.train
start-epochint, manual epoch number (useful on restarts)
如果None,赋值为0 或resume 对应的epoch+1
default=None接着训练
resume完整路径文件
Resume full model and optimizer state from checkpoint
helpers.py/resume_checkpoint,
与 initial-checkpoint 有相似处
default=‘’接着训练
no-resume-optprevent resume of optimizer state when resuming modelFalse接着训练
torchscriptconvert model torchscript for inferencemodel
dropDropout ratedefault=0.0model
gpGlobal pool type, one of (fast, avg, max, avgmax, avgmaxc). Model default if None.default=Nonemodel
optOptimizerdefault=‘sgd’opt
opt-epsfloat, Optimizer Epsilondefault=Noneopt
opt-betasfloat, Optimizer Betasdefault=Noneopt
lrlearning ratedefault=0.05opt
Learning rate schedule
momentumOptimizer momentumdefault=0.9opt
weight-decayweight decaydefault=2e-5opt
clip-gradClip gradient norm (default: None, no clipping)default=Noneopt
clip-modeGradient clipping mode. One of (“norm”, “value”, “agc”)default=‘norm’opt
schedstr, LR scheduler 学习率下降‘cosine’create_scheduler
decay-ratefloat, LR decay rate 衰减率0.1create_scheduler
warmup-lrfloat, warmup learning rate 先从该值开始上升再开始衰减0.0001create_scheduler
min-lrfloat, lower lr bound for cyclic schedulers that hit 0 (1e-5)1e-6create_scheduler
decay-epochsfloat, epoch interval to decay LR100create_scheduler
warmup-epochsint, epochs to warmup LR, if scheduler supports3create_scheduler
cooldown-epochsint, epochs to cooldown LR at min_lr, after cyclic schedule ends10create_scheduler
patience-epochsint, patience epochs for Plateau LR scheduler,迭代超过该次数但loss 不减小就降低lr10create_scheduler
lr-noisefloat, learning rate noise on/off epoch percentagesdefault=Nonecreate_scheduler
lr-noise-pctfloat, learning rate noise limit percent0.67create_scheduler
lr-noise-stdfloat, learning rate noise std-dev1.0create_scheduler
lr-cycle-mulfloat, learning rate cycle len multiplier1.0create_scheduler
lr-cycle-decayfloat, amount to decay each learning rate cycle0.5create_scheduler
lr-cycle-limitint, learning rate cycle limit, cycles enabled if > 11create_scheduler
lr-k-decayfloat, learning rate k-decay for cosine/poly1.0create_scheduler
ttaTest/inference time augmentation (oversampling) factor. 0=None (default: 0)default=0
checkpoint_histtype=int, number of checkpoints to keep
epoch 次训练后,最后存几组模型参数和训练结果
default=10
需要自行添加的变量
distributedEnable tracking moving average of model weightsdefault=Falsedistribute
world_sizewhen distributed, = torch.distributed.get_world_size(), means total processes, 1 GPU per process.default=1distribute
rankwhen distributed, = torch.distributed.get_rank()default=1distribute
hflip训练集 transforms.RandomHorizontalFlip0.5create_loader
vflip训练集 transforms.RandomVerticalFlip0create_loader
color_jitter训练集 transforms.ColorJitter0.4create_loader

1. convert

文件夹里主要包含mxnet和flax,其功能分别是从mxnet和nest_flax预训练模型到pytorch模型的转换,其原因是预训练模型在不同深度学习框架中的转换是一种常见的任务。

2. docs

文件夹里主要是一些文档的.md文件,其中包括各种模型的参数,使用步骤,使用要求,代码来源以及在ImageNet上的Top1和Top5识别准确率。如果有需要使用某一个模型,可以在docs文件夹里查找相关信息。
2.1. index.md,timm 入门
2.2. feature_extraction.md
2.3. models.md 介绍各种模型的参考paper
2.4. results.md 各种模型在 ImageNet 的应用准确度展示
2.5. scripts.md 和 training_hparam_examples.md 都是通过参数设定调用脚本举例
2.6. docs 具体每个模型通过load 去使用的代码举例

3. notebooks

作者复现模型的一些代码,笔记,jupyter notebook

4. results

文件夹主要是放置一些结果文件

5. test

文件夹是一些测试程序,包括对于层数,模型,优化器,工具类的测试。读者可以根据自己的需要,在这个文件夹里,测试自己模型的参数,预测最好的效果。

6. timm

6.1. data

主要包含图片参数设置,可以导入路径,tar压缩包等,除此之外,还有对图片进行预处理的操作:自动增强,transforms结构等,可以使网络权重更加精确,识别效果更佳优秀。transform结构最开始来源于NLP,因为其self-attention的机制,应用于图片处理,效果也是很好,所以该结构在深度学习中,极受欢迎。

6.1.1. auto_augment.py

主要用到PIL

6.1.2. config.py

resolve_data_config(args, default_cfg={}, model=None, use_test_size=False, verbose=False)
生成 new_config = {},包含 input/image size 默认(3, 224, 224), interpolation method, mean and std deviation for normalization, crop percentage(验证集图片中心裁剪的比列)

6.1.3. dataset_factory.py 的 create_dataset

参数
name,
root,
split=‘validation’,
search_split=True,
class_map=None,
load_bytes=False,
is_training=False,
download=False,
batch_size=None,
repeats=0,
**kwargs

name 设置为 ‘Afolder/Bfolder’,定义怎么取数据的方式或取什么数据

  1. 当name.lower().startswith(‘torch/’)
    1.1 当Bfolder 是 from torchvision.datasets import CIFAR100, CIFAR10, MNIST, QMNIST, KMNIST, FashionMNIST 其中之一,可以取到公共数据集
    1.2 当Bfolder 是 imagenet, 另外需要 split 是val 验证或测试,可以获得 ImageNet
    1.2 当Bfolder 是 image_folder 或 folder 就使用torchvision.datasets.ImageFolder 获取数据集,还需要root 即路径,此时必须是dir, 会判断os.path.isdir(root),不需要精确到用途这一层,当split 指定用途且search_split=True,会自动寻找是否有’root/train’ 或者’root/training’ ,测试模式同理
  2. name.lower().startswith(‘tfds/’)
  3. 其他,使用 timm/data/dataset.py 的ImageDataset取数据,暂不说明

split 定义了dataset 的用途,有两种
_TRAIN_SYNONYM = {‘train’, ‘training’}
_EVAL_SYNONYM = {‘val’, ‘valid’, ‘validation’, ‘eval’, ‘evaluation’},
取其中一个值

所以在当前任务中,需要以下参数设定
name=‘torch/image_folder’,
root,
split=‘validation’,
search_split=True,
is_training=False

6.1.4. loader.py 的 create_loader

基础用 loader =torch.utils.data.DataLoader,默认使用PrefetchLoader,当 use_multi_epochs_loader = True, 嵌套 MultiEpochsDataLoader,当 prefetcher = True, 嵌套
PrefetchLoader(loader,
mean=mean,
std=std,
channels=input_size[0],
fp16=fp16,
re_prob=prefetch_re_prob,
re_mode=re_mode,
re_count=re_count,
re_num_splits=re_num_splits,与resplit, num_aug_splits 有关
)
Attention
torch.utils.data.DataLoader 的 num_worker 参数容易报错 errno 32 broken pipe,是因为 Pytorch 在 win10 中暂不支持多线程导致的 bug,把 num_worker 改为0 即可解决

train_loader = torch.utils.data.DataLoader( trainData, batch_size=40, shuffle=True, 
    num_workers=0, # 在此处,把num_workers设为0
)

但是
还会报错

ValueError: persistent_workers option needs num_workers > 0

是因为 create_loader()默认参数 persistent_workers=True, persistent_workers 也是给 torch.utils.data.DataLoader 定义的,好处是Epoch之间不必重复关闭启动worker进程,加快训练速度,但是 persistent_workers=True 与 num_workers=0 冲突。所以需要将create_loader(persistent_workers=False).

persistent_workers (bool, optional) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False) (如果为True,数据加载器将不会在数据集运行完一个Epoch后关闭worker进程。这允许维护worker数据集实例保持激活, 继续进行下一个Epoch的数据加载。)

6.1.5. real_labels.py

class RealLabelsImagenet, func add_result(self, output), get_accuracy(self, k=None)

6.1.6. transforms_factory.py 的 create_transform

如果 is_training=True and no_aug =True, 缩放+中心裁剪
如果 is_training=True and no_aug =False,默认依赖 auto_augment,hflip,color_jitter 参数进行处理

6.2. loss

包中含有多标签之非对称损失函数,解决了多标签分类任务中,正负样本不平衡问题,标签错误问题。该方法,高效,容易使用。相比于最近的其他方法,该方法基于主流的网络结构,并且不需要其他的信息。当然,还有其他损失函数,比如binary_cross_entropy,cross_entropy, jsd 等等。train.py 默认 LabelSmoothingCrossEntropy

6.3. models

模型load 的函数,类似torchvision.models

  1. models/factory.py/create_model
参数annotation
Model_name (str)要实例化的模型的名称
pretrained (bool)如果为真,则加载预训练的ImageNet-1k权重
checkpoint_path (str)模型初始化后要加载的检查点路径
checkpoint_path 非空,models/helpers.py/load_checkpoint(model, checkpoint_path, use_ema=False, strict=True),
其中获得了已保存模型训练的参数键值 state_dict = load_state_dict(checkpoint_path, use_ema)
再加载给模型model.load_state_dict(state_dict, strict=strict),
与pretrained 无关
Scriptable (bool)设置层配置,使模型是jit脚本化的(尚未对所有模型工作)
exportable (bool)设置图层配置,使模型是可跟踪的/ ONNX可导出的(尚未完全impl/服从)
no_jit (bool)设置图层配置,这样模型就不会使用jit脚本化的图层(到目前为止只使用激活)
drop_rate(浮动)训练的退出率(默认:0.0)
global_pool (str)全局池类型(默认为’avg’)
  1. models/helpers.py,
    2.1. 接着训练resume_checkpoint(model, checkpoint_path, optimizer=None, loss_scaler=None, log_info=True),在train.py 使用时,498 行的 lr_scheduler.step(start_epoch) 会报“metric" 不能为None 的错,可以修改resume_checkpoint输出属性 metric.
    2.2. 用自训练的参数初始化模型,重新训练 load_checkpoint(model, checkpoint_path, use_ema=False, strict=True),其中helpers.py/load_state_dict(checkpoint_path, use_ema=False) 是获取checkpoint_path保存的模型相关参数,model.load_state_dict 是把参数带入到模型

/layers/test_time_pool.py

apply_test_time_pool(model, config, use_test_size=True) -> model, test_time_pool(True, false)

6.4. optim

包里是一些优化器的选择,优化器的作用是自动设置权重步长. 给定了 lr, 返回loss?

optim_factory.py

create_optimizer_v2(
model_or_params,
opt: str = ‘sgd’,
lr: Optional[float] = None,
weight_decay: float = 0.,
momentum: float = 0.9,
filter_bias_and_bn: bool = True,
layer_decay: Optional[float] = None,
param_group_fn: Optional[Callable] = None,
**kwargs):
Create an optimizer.
TODO currently the model is passed in and all parameters are selected for optimization.
For more general use an interface that allows selection of parameters to optimize and lr groups, one of:
* a filter fn interface that further breaks params into groups in a weight_decay compatible fashion
* expose the parameters interface and leave it up to caller
Args:
model_or_params (nn.Module): model containing parameters to optimize
opt: name of optimizer to create
lr: initial learning rate
weight_decay: weight decay to apply in optimizer
momentum: momentum for momentum based optimizers (others may use betas via kwargs)
filter_bias_and_bn: filter out bias, bn and other 1d params from weight decay
**kwargs: extra optimizer specific kwargs to pass through
Returns:
Optimizer

6.5. scheduler

主要是学习率的设置, 与内置的PyTorch调度程序不同,它的目的是在每个epoch结束时,在递增的epoch计数之前,一致地调用它来计算下一个epoch的值;在每个优化器更新结束时,在递增的更新计数之后,计算下一个更新的值。(Unlike the builtin PyTorch schedulers, this is intended to be consistently called at the END of each epoch, before incrementing the epoch count, to calculate next epoch’s value & at the END of each optimizer update, after incrementing the update count, to calculate next update’s value.) 所以,在训练中每个batch 结束后有 lr_scheduler.step_update(num_updates=num_updates, metric=losses_m.avg),在每个epoch 后有 lr_scheduler.step(epoch + 1, eval_metrics[eval_metric]) 。这也解释了lr_scheduler.step_update 与 lr_scheduler.step 的作用。

scheduler_factory.py 定义了 lr_scheduler, num_epochs = create_scheduler(args, optimizer),其他都是学习率的方法有以下几种

lr_schedulernum_epochs底层使用
from .cosine_lr import CosineLRSchedulernum_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochsthe SGDR scheduler also referred to as the cosine scheduler in timm
from .multistep_lr import MultiStepLRSchedulernum_epochs = args.epochs
from .plateau_lr import PlateauLRSchedulernum_epochs = args.epochsThis scheduler is very similar to PyTorch’s ReduceLROnPlateau scheduler. The basic idea is to track an eval metric and based on the evaluation metric’s value, the lr is reduced using StepLR if the eval metric is stagnant for a certain number of epochs.Decay the LR by a factor every time the validation loss plateaus.
The PlateauLRScheduler by default tracks the eval-metric which is by default top-1 in the timm training script. If the performance plateaus, then the new learning learning after a certain number of epochs (by default 10) is set to lr * decay_rate. This scheduler underneath uses PyTorch’s ReduceLROnPlateau.
from .poly_lr import PolyLRSchedulernum_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs
from .step_lr import StepLRSchedulernum_epochs = args.epochsThe StepLR is a basic step LR schedule with warmup, noise. PyTorch’s implementation does not support warmup or noise. After a certain number decay_epochs, the learning rate is updated to be lr * decay_rate.
from .tanh_lr import TanhLRSchedulernum_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochsStochastic Gradient Descent with Hyperbolic-Tangent Decay on Classification. This is also referred to as the tanh annealing. tanh stands for hyperbolic tangent decay.

当 lr_cycle_limit = 1 且 lr_cycle_mul =1,lr_scheduler.get_cycle_length() = args.epochs.
timm 的LRScheduler 都继承了 torch 的 Scheduler,也有了属性 Scheduler.step 和 Scheduler.step_update

Args of PlateauLRScheduler contrasted in timm with in pytorch

TIMMPyTorch
patience_tpatience
decay_ratefactor
verboseverbose
thresholdthreshold
cooldown_tcooldown
modemode
lr_minmin_lr

当 args.sched == ‘plateau’,只有eval_metric = “loss”, mode 才是”min", 否则mode = ”max" ;PlateauLRScheduler 默认也是"max"

PlateauLRScheduler(
                  optimizer,
                 decay_rate=0.1,
                 patience_t=10,
                 verbose=True,
                 threshold=1e-4,
                 cooldown_t=0,
                 warmup_t=0,
                 warmup_lr_init=0,
                 lr_min=0,
                 mode='max',
                 noise_range_t=None,
                 noise_type='normal',
                 noise_pct=0.67,
                 noise_std=1.0,
                 noise_seed=None,
                 initialize=True,

6.6. utils

包里是一些ResNet和MobileNet网络的一些工具类,主要还是为网络结构服务

6.6.1. checkpoint_saver.py

class CheckpointSaver(
model=model, optimizer=optimizer, args=args, model_ema=model_ema, amp_scaler=loss_scaler,
checkpoint_dir=output_dir, recovery_dir=output_dir, decreasing=decreasing, max_history=args.checkpoint_hist),

func save_checkpoint, _save, _cleanup_checkpoints, save_recovery, find_recovery

save_checkpoint(self, epoch, metric=None)
  1. epoch 次迭代,最终只能有10组最优的模型参数和结果被保存,使用 checkpoint_files.append((save_file_path, metric)),有更优的出现就会替换已有的最差的结果,并且排序,
  2. 排序规则按照 metric 升序或降序, train.py 中就是平均“loss",reverse=not decreasing, 即 decreasing=True, 最后选择的是使得loss 最小的10组参数,记入log中
  3. 统一了保存的命名规则,‘checkpoint’ + str(epoch) + ‘.pth.tar’
  4. return (None, None) if self.best_metric is None else (self.best_metric, self.best_epoch),但不确定是不是指 10 结果里 最好的metric 及对应的epoch
_save(self, save_path, epoch, metric=None)

使用 torch.save(), 记录了
epoch
model
args
get_state_dict(self.model, self.unwrap_fn),
optimizer.state_dict(),
‘version’
amp_scaler.state_dict()
get_state_dict(self.model_ema, self.unwrap_fn)
metric

_cleanup_checkpoints

定义了删除多余checkpoint 的规则方法, checkpoint_files 排序以后index>=10 的都删掉

save_recovery

保存某次epoch, 某batch 后的模型参数结果

6.6.2. summary.py

  1. get_outdir(path, *paths, inc=False), 设置output_dir,
    inc=True, 计数方法定义文件夹名称,
    inc=False,且不存在path*paths 的路径就创建一个,反之就直接使用,
    train.py 中,*paths = experiment, 若experiment 非默认 ‘’,

  2. update_summary(epoch, train_metrics, eval_metrics, filename, write_header=False, log_wandb=False),train,valid 结果写入csv

6.6.3. log.py

setup_default_logging(default_level=logging.INFO, log_path=‘’)

6.6.4. metrics.py

定义了class AverageMeter,accuracy

accuracy(output, target, topk=(1,))

参数:
output 为预测概览矩阵,大小为batch_size * num(label),
target 为实际 label 矩阵,大小为 1* batch_size
topk一般设为(1, n), 表示最多取预测矩阵的前 n 个最大值,当 n > num(label) 也没关系,maxk = min(max(topk), output.size()[1]) 会处理,因为函数中
, pred = output.topk(maxk, 1, True, True),实际最多取预测矩阵的前 maxk 个最大值,其中 "" 是 预测概率,pred 是索引,即表示预测出的 label
输出:
本 batch 中 最大预测概率对应的索引就是正确 label 的样本数量占比,本 batch 中,前 maxk 个最大预测概率对应的索引包含正确 label 的样本数量占比。

6.6.5. misc.py

natural_key(string_) 给文档排序sorted 的key 赋值

7. avg_checkpoint.py

作用是匹配指定路径上的所有模型权重的过滤器通配符。为了取得较好的结果,这些checkpoint必须来源于相同训练。

8 benchmark.py

是timm模型的推理和训练步骤基准的脚本。

9 inference.py

是一个示例推理脚本,将文件夹中的图像的top-k类id输出到csv中。

10 train.py

这是一个精简的、易于修改的ImageNet训练脚本,可以重新生成ImageNet,训练结果与一些最新的网络和训练技术。它倾向于规范的PyTorch和标准的Python风格,也就是说,提供了很多的训练速度和改进结果的PyTorch示例脚本,自己可以自由选择。

参数

These arguments are to define Dataset/Model parameters, Optimizer parameters, Learnining Rate scheduler parameters, Augmentation and regularization, Batch Norm parameters, Model exponential moving average parameters, and some miscellaneaous parameters such as --seed, --tta etc.
Do note that some random augmentations are set by default such as color_jitter, hfliip but there is a parameter no-aug in case you wanted to turn of all training data augmentations. Also, the default optimizer opt is ’sgd’ but it is possible to change that. timm offers a vast number of optimizers to train your models with.

Column 1Column 2
–aaAuto-Augment

运行

Distributed Training on multiple GPUs
To train models on multiple GPUs, simply replace python train.py with ./distributed_train.sh like so:

./distributed_train.sh 4 ./imagenette2-320 --aug-splits 3 --jsd

This trains the model using AugMix data augmentation on 4 GPUs.

step and args

  1. loss function
    args.jsd_loss, args.aug_splits, args.smoothing
    args.mixup, args.cutmix, args.cutmix_minmax, args.bce_loss, args.bce_target_thresh
  2. 分布式计算
    args.local_rank:gpu_id, 为int类型变量,只能指定一张显卡,默认0
    另一个作用,当 args.local_rank== 0,写 log,有时save train_batch_image
  3. train
  4. 用不用wandb, 首先要装载wandb,使得has_wandb =True,还涉及 args.log_wandb, 默认是False
  5. augmentation,mixup,cutmix
  6. Attention
    当使用混合精度时,train_one_epoch 与 validate 会报错,
Error: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

因为没有把input 放到gpu, 若 args.prefetcher =False , 需要将

if not args.prefetcher: 
改成
if args.prefetcher:

11 vaildate.py

是一个精简且易于修改的ImageNet验证脚本,其功能与train.py类似。

11.1. 参数

nameannotationdefault or examplerelation
valid-labels一个 .txt 文件的路径,里面放着需要验证的部分或全部标签索引,一行一个,
内容会赋值给变量valid_labels,
output = model(input)
output = output[:, valid_labels]
default=‘’
real-labels路径, Real labels JSON file for imagenet evaluationdefault=‘’

notice

要使用 from dataset_factory_joyce import create_dataset_joyce,只能使 create_loader 的 use_prefetcher=args.no_prefetcher=False

11.2. debug

validate.py from timm.utils import set_jit_fuser 报错,改成 from timm.utils import * 可以正常运行

Q

  1. distribute = TRUE, device 也只是一个cuda: local_rank, 怎么就分布式了?
  2. wandb 没用起来
    解决 3. what is cnDNN,
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值