图像分类pytorch-image-models-master代码目录解析

最新推荐文章于 2024-09-13 21:43:30 发布

wwzl0980

最新推荐文章于 2024-09-13 21:43:30 发布

阅读量968

点赞数

文章标签： pytorch 分类深度学习

本文链接：https://blog.csdn.net/wwzl0980/article/details/123865304

版权

图像分类pytorch-image-models-master代码目录解析

参考
名词介绍
文件说明
参数
1. convert
2. docs
3. notebooks
4. results
5. test
6. timm
- 6.1. data
- 6.2. loss
- 6.3. models
- - /layers/test_time_pool.py
- 6.4. optim
- - optim_factory.py
- 6.5. scheduler
- - Args of PlateauLRScheduler contrasted in timm with in pytorch
- 6.6. utils
7. avg_checkpoint.py
8 benchmark.py
9 inference.py
10 train.py
- 参数
- 运行
- step and args
11 vaildate.py
- 11.1. 参数
- notice
11.2. debug
Q

参考

https://fastai.github.io/timmdocs/training_modelEMA
https://blog.csdn.net/weixin_44396553/article/details/120901765
https://github.com/rwightman/pytorch-image-models
https://fastai.github.io/timmdocs/training
broken pipe https://blog.csdn.net/qq_26369907/article/details/99701006
https://www.cnblogs.com/jiangkejie/p/14965003.html
Pytorch Image Models (a.k.a. timm) has a lot of pretrained models and interface which allows using these models as encoders in smp, however, not all models are supported.
transformer models do not have features_only functionality implemented
some models do not have appropriate strides

https://zhuanlan.zhihu.com/p/469323798

名词介绍

缩写	全称	解释
AdvProp	Adversarial Propagation	AdvProp is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. A method that learns from both clean images as well as adversarially modified images. Since clean images are derived from a different distribution as compared to adversarial images, the model needs to also use the batch statistics according to image source. If not, the model may not be effective in extracting accurate features. https://paperswithcode.com/paper/adversarial-examples-improve-image#code; https://resbyte.github.io/posts/2020/06/Adversarial-Prop/?msclkid=ca22832fb15511ecbd8c5c507c8acfab
BatchNorm(BN)	Batch normalization	Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. can be implemented during training by calculating the mean and standard deviation of each input variable to a layer per mini-batch and using these statistics to perform the standardization. Alternately, a running average of mean and standard deviation can be maintained across mini-batches, but may result in unstable training. https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/?msclkid=40fc3fceb15711ecab625b99bab242c6 https://keras.io/api/layers/normalization_layers/batch_normalization/?msclkid=40fd1d9bb15711eca054521d10053fb0
AugMix		We propose AugMix, a data processing technique that mixes augmented images andenforces consistent embeddings of the augmented images, which results inincreased robustness and improved uncertainty calibration. timm also supports augmix with RandAugment and AutoAugment. https://github.com/google-research/augmix?msclkid=68e4d7b9b18d11eca899bc519f05d2e6
SyncBN	Synchronized Batch Normalization (同步)	SyncBN is a type of batch normalization used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch.
SGDR	Stochastic Gradient Descent with Warm Restarts

文件说明

参数

name	annotation	default or example	relation
data_dir	自定义数据集的存放路径，create_dataset 时使用	例如 ’d:/image’(包含下级路径 ’d:/image/train’ 和 ’d:/image/valid’	create_dataset
dataset	create_dataset 时使用, 设置为 ‘Afolder/Bfolder’，定义怎么取数据的方式或取什么数据	‘torch/image_folder’	create_dataset
train-split	create_dataset 时使用, 赋值给参数split，设置数据集作为训练集使用	‘train’	create_dataset
val-split	create_dataset 时使用, 赋值给参数split，设置数据集作为测试集使用	‘validation’	create_dataset
class-map	dataset = ‘torch/…’ 或 ’tfds/…’ 时不需要	通过text file 或 dict 类别索引	create_dataset
dataset_download	create_dataset 时，如果要获取 pytorch 中的指定公共数据集，需要下载	False	create_dataset
input_size	图片大小，自定义输入’1’,‘224’,‘224’，转译成 list 或 tuple，eg. [3, 224, 224]	None	create_loader
img_size	图片大小, int /timm/data/config.py/resolve_data_config 中, input_size 首先取用户定义的input_size, 其次 input_size = (3, args[‘img_size’], args[‘img_size’]) 最后取模型default_cfg 相关的 input_size	None	create_loader
batch_size	train batch size	128	create_loader
validation-batch-size	Validation batch size	default=None	create_loader
crop_pct	float, Input image center crop percent (for validation only) resolve_data_config 中会去模型config中定义的值	default=None	create_loader
mean	Override mean pixel value of dataset loader.py 的 create_loader resolve_data_config 中会去模型config中定义的值	default=None	create_loader
std	Override std deviation of dataset loader.py 的 create_loader resolve_data_config 中会去模型config中定义的值	default=None	create_loader
interpolation	'Image resize interpolation type (overrides model) resolve_data_config 中会去模型config中定义的值	default=‘’	create_loader
train-interpolation	Training interpolation (random, bilinear, bicubic default: “random”) 有时取args.interpolation	‘random’	create_loader
aug-splits	Number of augmentation splits (default: 0, valid: 0 or >=2), train.py 中赋值给了num_aug_splits. we passed in num_aug_splits=2. In this case, the loader_train has the first 8 original images and next 8 images that represent augmix1. Had we passed in num_aug_splits=3, then the effective batch_size would have been 24, where the first 8 images would have been the original images, next 8 representing augmix1 and the last 8 representing augmix2.	default=0	loss augmentation
local_rank	主要是写 log	default=0
use-multi-epochs-loader	use the /timm/data/loader.py/MultiEpochsDataLoader to save time at the beginning of every epoch	default=False
log-wandb	log training and validation metrics to wandb	default=False
no-prefetcher	train.py 中，prefetcher = not no-prefetcher	default=False，means disable fast prefetcher
no_aug	when True, disable all training augmentation	False	create_loader
reprob	Random erase prob, loader.py 的 create_loader 中 reprob if is_training and not no_aug else 0.	default=0
remode	Random erase mode loader.py 的 create_loader	default=‘pixel’
recount	Random erase count loader.py 的 create_loader	default=1
resplit	Do not random erase first (clean) augmentation split loader.py 的 create_loader	default=False
mixup	mixup alpha, mixup enabled if > 0	default=0.0	loss augmentation
cutmix	cutmix alpha, cutmix enabled if > 0	default=0.0	loss augmentation
cutmix-minmax	cutmix min/max ratio, overrides alpha and enables cutmix if set	default=None	loss augmentation
smoothing	Label smoothing	default=0.1	loss augmentation
jsd-loss	Enable Jensen-Shannon Divergence + CE loss. Use with aug-splits	default=False	loss
bce-loss	Enable BCE loss w/ Mixup/CutMix use	default=False	loss
bce-target-thresh	Threshold for binarizing softened BCE targets (default: None, disabled)	default=None	loss
mixup-off-epoch	Turn off mixup after this epoch, disabled if 0	default=0	train
log_interval	how many batches to wait before logging training status 不能为0，设置了第几个batch写一次log, 计算一次指标平均值	default=50
save_images	save images of input bathes every log interval for debugging	default=False
recovery_interval	how many batches to wait before writing recovery checkpoint	default=0
amp	use NVIDIA Apex AMP or Native AMP for mixed precision training	default=False	mixed precision training
apex-amp	Use NVIDIA Apex AMP mixed precision	default=False	mixed precision training
ative-amp	Use Native Torch AMP mixed precision	default=False	mixed precision training
checkpoint-hist	number of checkpoints to keep default=10	saver
eval-metric	验证函数 validate 的输出中评价指标之一的 key，decreasing = True if eval_metric == ‘loss’ else False，decreasing 用在 saver 中做排序sort, 是否要倒序排列，所以需要与判断值要包含在输出中	default=‘top1’	saver
experiment	1. 使用 wandb 时创建的项目名称 2. 训练结果输出路径的 sub-folder 文件夹名称	default=‘’	wandb
model-ema	Enable tracking moving average of model weights, 是否使用ema	default=False	ema
model-ema-force-cpu	Force ema to be tracked on CPU, rank=0 node only. Disables EMA validation.	default=False	ema
model-ema-decay	decay factor for model weights moving average (default: 0.9998)	default=0.9998	ema
output	path to output folder (default: none, current dir) 与变量output_dir 有关， =output 或 ‘./output/train’，+时间+experiment+img.weight)	default=‘’
channels-last	Use channels_last memory layout	default=False	model
model	Name of model to train	default=‘resnet50’	model
pretrained	Start with pretrained version of specified network (if avail) ，对应 create_model 的 pretrained	default=False	model
initial-checkpoint	完整路径文件 Initialize model from this checkpoint, 对应 create_model 的 checkpoint_path	default=‘’	model
num-classes	number of label classes Model must have `num_classes` attr if not set on cmd line/config.	default=None	model
epochs	int, number of epochs to train	300	train
epoch-repeats	float, epoch repeat multiplier (number of times to repeat dataset epoch per train epoch)	0.	train
start-epoch	int, manual epoch number (useful on restarts) 如果None，赋值为0 或resume 对应的epoch+1	default=None	接着训练
resume	完整路径文件 Resume full model and optimizer state from checkpoint helpers.py/resume_checkpoint, 与 initial-checkpoint 有相似处	default=‘’	接着训练
no-resume-opt	prevent resume of optimizer state when resuming model	False	接着训练
torchscript	convert model torchscript for inference		model
drop	Dropout rate	default=0.0	model
gp	Global pool type, one of (fast, avg, max, avgmax, avgmaxc). Model default if None.	default=None	model
opt	Optimizer	default=‘sgd’	opt
opt-eps	float, Optimizer Epsilon	default=None	opt
opt-betas	float, Optimizer Betas	default=None	opt
lr	learning rate	default=0.05	opt Learning rate schedule
momentum	Optimizer momentum	default=0.9	opt
weight-decay	weight decay	default=2e-5	opt
clip-grad	Clip gradient norm (default: None, no clipping)	default=None	opt
clip-mode	Gradient clipping mode. One of (“norm”, “value”, “agc”)	default=‘norm’	opt
sched	str, LR scheduler 学习率下降	‘cosine’	create_scheduler
decay-rate	float, LR decay rate 衰减率	0.1	create_scheduler
warmup-lr	float, warmup learning rate 先从该值开始上升再开始衰减	0.0001	create_scheduler
min-lr	float, lower lr bound for cyclic schedulers that hit 0 (1e-5)	1e-6	create_scheduler
decay-epochs	float, epoch interval to decay LR	100	create_scheduler
warmup-epochs	int, epochs to warmup LR, if scheduler supports	3	create_scheduler
cooldown-epochs	int, epochs to cooldown LR at min_lr, after cyclic schedule ends	10	create_scheduler
patience-epochs	int, patience epochs for Plateau LR scheduler，迭代超过该次数但loss 不减小就降低lr	10	create_scheduler
lr-noise	float, learning rate noise on/off epoch percentages	default=None	create_scheduler
lr-noise-pct	float, learning rate noise limit percent	0.67	create_scheduler
lr-noise-std	float, learning rate noise std-dev	1.0	create_scheduler
lr-cycle-mul	float, learning rate cycle len multiplier	1.0	create_scheduler
lr-cycle-decay	float, amount to decay each learning rate cycle	0.5	create_scheduler
lr-cycle-limit	int, learning rate cycle limit, cycles enabled if > 1	1	create_scheduler
lr-k-decay	float, learning rate k-decay for cosine/poly	1.0	create_scheduler
tta	Test/inference time augmentation (oversampling) factor. 0=None (default: 0)	default=0
checkpoint_hist	type=int, number of checkpoints to keep epoch 次训练后，最后存几组模型参数和训练结果	default=10
需要自行添加的变量
distributed	Enable tracking moving average of model weights	default=False	distribute
world_size	when distributed, = torch.distributed.get_world_size(), means total processes, 1 GPU per process.	default=1	distribute
rank	when distributed, = torch.distributed.get_rank()	default=1	distribute
hflip	训练集 transforms.RandomHorizontalFlip	0.5	create_loader
vflip	训练集 transforms.RandomVerticalFlip	0	create_loader
color_jitter	训练集 transforms.ColorJitter	0.4	create_loader

1. convert

文件夹里主要包含mxnet和flax，其功能分别是从mxnet和nest_flax预训练模型到pytorch模型的转换，其原因是预训练模型在不同深度学习框架中的转换是一种常见的任务。

2. docs

文件夹里主要是一些文档的.md文件，其中包括各种模型的参数，使用步骤，使用要求，代码来源以及在ImageNet上的Top1和Top5识别准确率。如果有需要使用某一个模型，可以在docs文件夹里查找相关信息。
2.1. index.md，timm 入门
2.2. feature_extraction.md
2.3. models.md 介绍各种模型的参考paper
2.4. results.md 各种模型在 ImageNet 的应用准确度展示
2.5. scripts.md 和 training_hparam_examples.md 都是通过参数设定调用脚本举例
2.6. docs 具体每个模型通过load 去使用的代码举例

3. notebooks

作者复现模型的一些代码，笔记，jupyter notebook

4. results

文件夹主要是放置一些结果文件

5. test

文件夹是一些测试程序，包括对于层数，模型，优化器，工具类的测试。读者可以根据自己的需要，在这个文件夹里，测试自己模型的参数，预测最好的效果。

6. timm

6.1. data

主要包含图片参数设置，可以导入路径，tar压缩包等，除此之外，还有对图片进行预处理的操作：自动增强，transforms结构等，可以使网络权重更加精确，识别效果更佳优秀。transform结构最开始来源于NLP，因为其self-attention的机制，应用于图片处理，效果也是很好，所以该结构在深度学习中，极受欢迎。

6.1.1. auto_augment.py

主要用到PIL

6.1.2. config.py

resolve_data_config(args, default_cfg={}, model=None, use_test_size=False, verbose=False)
生成 new_config = {}，包含 input/image size 默认（3, 224, 224）, interpolation method, mean and std deviation for normalization, crop percentage(验证集图片中心裁剪的比列)

6.1.3. dataset_factory.py 的 create_dataset

参数
name,
root,
split=‘validation’,
search_split=True,
class_map=None,
load_bytes=False,
is_training=False,
download=False,
batch_size=None,
repeats=0,
**kwargs

name 设置为 ‘Afolder/Bfolder’，定义怎么取数据的方式或取什么数据

当name.lower().startswith(‘torch/’)
1.1 当Bfolder 是 from torchvision.datasets import CIFAR100, CIFAR10, MNIST, QMNIST, KMNIST, FashionMNIST 其中之一，可以取到公共数据集
1.2 当Bfolder 是 imagenet, 另外需要 split 是val 验证或测试，可以获得 ImageNet
1.2 当Bfolder 是 image_folder 或 folder 就使用torchvision.datasets.ImageFolder 获取数据集，还需要root 即路径，此时必须是dir, 会判断os.path.isdir(root)，不需要精确到用途这一层，当split 指定用途且search_split=True，会自动寻找是否有’root/train’ 或者’root/training’ ，测试模式同理
name.lower().startswith(‘tfds/’)
其他，使用 timm/data/dataset.py 的ImageDataset取数据，暂不说明

split 定义了dataset 的用途，有两种
_TRAIN_SYNONYM = {‘train’, ‘training’}
_EVAL_SYNONYM = {‘val’, ‘valid’, ‘validation’, ‘eval’, ‘evaluation’}，
取其中一个值

所以在当前任务中，需要以下参数设定
name=‘torch/image_folder’,
root,
split=‘validation’,
search_split=True,
is_training=False

6.1.4. loader.py 的 create_loader

基础用 loader =torch.utils.data.DataLoader，默认使用PrefetchLoader，当 use_multi_epochs_loader = True, 嵌套 MultiEpochsDataLoader，当 prefetcher = True, 嵌套
PrefetchLoader(loader,
mean=mean,
std=std,
channels=input_size[0],
fp16=fp16,
re_prob=prefetch_re_prob,
re_mode=re_mode,
re_count=re_count,
re_num_splits=re_num_splits，与resplit， num_aug_splits 有关
)
Attention
torch.utils.data.DataLoader 的 num_worker 参数容易报错 errno 32 broken pipe，是因为 Pytorch 在 win10 中暂不支持多线程导致的 bug，把 num_worker 改为0 即可解决

train_loader = torch.utils.data.DataLoader( trainData, batch_size=40, shuffle=True, 
    num_workers=0, # 在此处，把num_workers设为0
)

但是
还会报错

ValueError: persistent_workers option needs num_workers > 0

是因为 create_loader(）默认参数 persistent_workers=True, persistent_workers 也是给 torch.utils.data.DataLoader 定义的，好处是Epoch之间不必重复关闭启动worker进程，加快训练速度，但是 persistent_workers=True 与 num_workers=0 冲突。所以需要将create_loader(persistent_workers=False).

persistent_workers (bool, optional) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False) (如果为True，数据加载器将不会在数据集运行完一个Epoch后关闭worker进程。这允许维护worker数据集实例保持激活, 继续进行下一个Epoch的数据加载。)

6.1.5. real_labels.py

class RealLabelsImagenet, func add_result(self, output), get_accuracy(self, k=None)

6.1.6. transforms_factory.py 的 create_transform

如果 is_training=True and no_aug =True, 缩放+中心裁剪
如果 is_training=True and no_aug =False，默认依赖 auto_augment，hflip，color_jitter 参数进行处理

6.2. loss

包中含有多标签之非对称损失函数，解决了多标签分类任务中，正负样本不平衡问题，标签错误问题。该方法，高效，容易使用。相比于最近的其他方法，该方法基于主流的网络结构，并且不需要其他的信息。当然，还有其他损失函数，比如binary_cross_entropy,cross_entropy, jsd 等等。train.py 默认 LabelSmoothingCrossEntropy

6.3. models

模型load 的函数，类似torchvision.models

models/factory.py/create_model

参数	annotation
Model_name (str)	要实例化的模型的名称
pretrained (bool)	如果为真，则加载预训练的ImageNet-1k权重
checkpoint_path (str)	模型初始化后要加载的检查点路径 checkpoint_path 非空，models/helpers.py/load_checkpoint(model, checkpoint_path, use_ema=False, strict=True)，其中获得了已保存模型训练的参数键值 state_dict = load_state_dict(checkpoint_path, use_ema) 再加载给模型model.load_state_dict(state_dict, strict=strict)，与pretrained 无关
Scriptable (bool)	设置层配置，使模型是jit脚本化的(尚未对所有模型工作)
exportable (bool)	设置图层配置，使模型是可跟踪的/ ONNX可导出的(尚未完全impl/服从)
no_jit (bool)	设置图层配置，这样模型就不会使用jit脚本化的图层(到目前为止只使用激活)
drop_rate(浮动)	训练的退出率(默认:0.0)
global_pool (str)	全局池类型(默认为’avg’)

models/helpers.py,
2.1. 接着训练resume_checkpoint(model, checkpoint_path, optimizer=None, loss_scaler=None, log_info=True)，在train.py 使用时，498 行的 lr_scheduler.step(start_epoch) 会报“metric" 不能为None 的错，可以修改resume_checkpoint输出属性 metric.
2.2. 用自训练的参数初始化模型，重新训练 load_checkpoint(model, checkpoint_path, use_ema=False, strict=True)，其中helpers.py/load_state_dict(checkpoint_path, use_ema=False) 是获取checkpoint_path保存的模型相关参数，model.load_state_dict 是把参数带入到模型

/layers/test_time_pool.py

apply_test_time_pool(model, config, use_test_size=True) -> model, test_time_pool(True, false)

6.4. optim

包里是一些优化器的选择，优化器的作用是自动设置权重步长. 给定了 lr, 返回loss?

optim_factory.py

create_optimizer_v2(
model_or_params,
opt: str = ‘sgd’,
lr: Optional[float] = None,
weight_decay: float = 0.,
momentum: float = 0.9,
filter_bias_and_bn: bool = True,
layer_decay: Optional[float] = None,
param_group_fn: Optional[Callable] = None,
**kwargs):
Create an optimizer.
TODO currently the model is passed in and all parameters are selected for optimization.
For more general use an interface that allows selection of parameters to optimize and lr groups, one of:
* a filter fn interface that further breaks params into groups in a weight_decay compatible fashion
* expose the parameters interface and leave it up to caller
Args:
model_or_params (nn.Module): model containing parameters to optimize
opt: name of optimizer to create
lr: initial learning rate
weight_decay: weight decay to apply in optimizer
momentum: momentum for momentum based optimizers (others may use betas via kwargs)
filter_bias_and_bn: filter out bias, bn and other 1d params from weight decay
**kwargs: extra optimizer specific kwargs to pass through
Returns:
Optimizer

6.5. scheduler

主要是学习率的设置, 与内置的PyTorch调度程序不同，它的目的是在每个epoch结束时，在递增的epoch计数之前，一致地调用它来计算下一个epoch的值;在每个优化器更新结束时，在递增的更新计数之后，计算下一个更新的值。(Unlike the builtin PyTorch schedulers, this is intended to be consistently called at the END of each epoch, before incrementing the epoch count, to calculate next epoch’s value & at the END of each optimizer update, after incrementing the update count, to calculate next update’s value.) 所以，在训练中每个batch 结束后有 lr_scheduler.step_update(num_updates=num_updates, metric=losses_m.avg)，在每个epoch 后有 lr_scheduler.step(epoch + 1, eval_metrics[eval_metric]) 。这也解释了lr_scheduler.step_update 与 lr_scheduler.step 的作用。

scheduler_factory.py 定义了 lr_scheduler, num_epochs = create_scheduler(args, optimizer)，其他都是学习率的方法有以下几种

lr_scheduler	num_epochs	底层	使用
from .cosine_lr import CosineLRScheduler	num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs	the SGDR scheduler also referred to as the cosine scheduler in timm
from .multistep_lr import MultiStepLRScheduler	num_epochs = args.epochs
from .plateau_lr import PlateauLRScheduler	num_epochs = args.epochs	This scheduler is very similar to PyTorch’s ReduceLROnPlateau scheduler. The basic idea is to track an eval metric and based on the evaluation metric’s value, the lr is reduced using StepLR if the eval metric is stagnant for a certain number of epochs.	Decay the LR by a factor every time the validation loss plateaus. The PlateauLRScheduler by default tracks the eval-metric which is by default top-1 in the timm training script. If the performance plateaus, then the new learning learning after a certain number of epochs (by default 10) is set to lr * decay_rate. This scheduler underneath uses PyTorch’s ReduceLROnPlateau.
from .poly_lr import PolyLRScheduler	num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs
from .step_lr import StepLRScheduler	num_epochs = args.epochs	The StepLR is a basic step LR schedule with warmup, noise. PyTorch’s implementation does not support warmup or noise. After a certain number decay_epochs, the learning rate is updated to be lr * decay_rate.
from .tanh_lr import TanhLRScheduler	num_epochs = lr_scheduler.get_cycle_length() + args.cooldown_epochs	Stochastic Gradient Descent with Hyperbolic-Tangent Decay on Classification. This is also referred to as the tanh annealing. tanh stands for hyperbolic tangent decay.

当 lr_cycle_limit = 1 且 lr_cycle_mul =1，lr_scheduler.get_cycle_length() = args.epochs.
timm 的LRScheduler 都继承了 torch 的 Scheduler，也有了属性 Scheduler.step 和 Scheduler.step_update

Args of PlateauLRScheduler contrasted in timm with in pytorch

TIMM	PyTorch
patience_t	patience
decay_rate	factor
verbose	verbose
threshold	threshold
cooldown_t	cooldown
mode	mode
lr_min	min_lr

当 args.sched == ‘plateau’，只有eval_metric = “loss”, mode 才是”min", 否则mode = ”max" ;PlateauLRScheduler 默认也是"max"

PlateauLRScheduler(
                  optimizer,
                 decay_rate=0.1,
                 patience_t=10,
                 verbose=True,
                 threshold=1e-4,
                 cooldown_t=0,
                 warmup_t=0,
                 warmup_lr_init=0,
                 lr_min=0,
                 mode='max',
                 noise_range_t=None,
                 noise_type='normal',
                 noise_pct=0.67,
                 noise_std=1.0,
                 noise_seed=None,
                 initialize=True,

6.6. utils

包里是一些ResNet和MobileNet网络的一些工具类，主要还是为网络结构服务

6.6.1. checkpoint_saver.py

class CheckpointSaver(
model=model, optimizer=optimizer, args=args, model_ema=model_ema, amp_scaler=loss_scaler,
checkpoint_dir=output_dir, recovery_dir=output_dir, decreasing=decreasing, max_history=args.checkpoint_hist),

func save_checkpoint, _save, _cleanup_checkpoints, save_recovery, find_recovery

save_checkpoint(self, epoch, metric=None)

epoch 次迭代，最终只能有10组最优的模型参数和结果被保存，使用 checkpoint_files.append((save_file_path, metric))，有更优的出现就会替换已有的最差的结果，并且排序,
排序规则按照 metric 升序或降序, train.py 中就是平均“loss"，reverse=not decreasing, 即 decreasing=True, 最后选择的是使得loss 最小的10组参数，记入log中
统一了保存的命名规则，‘checkpoint’ + str(epoch) + ‘.pth.tar’
return (None, None) if self.best_metric is None else (self.best_metric, self.best_epoch)，但不确定是不是指 10 结果里最好的metric 及对应的epoch

_save(self, save_path, epoch, metric=None)

使用 torch.save()，记录了
epoch
model
args
get_state_dict(self.model, self.unwrap_fn),
optimizer.state_dict(),
‘version’
amp_scaler.state_dict()
get_state_dict(self.model_ema, self.unwrap_fn)
metric

_cleanup_checkpoints

定义了删除多余checkpoint 的规则方法， checkpoint_files 排序以后index>=10 的都删掉

save_recovery

保存某次epoch, 某batch 后的模型参数结果

6.6.2. summary.py

get_outdir(path, *paths, inc=False), 设置output_dir，
inc=True, 计数方法定义文件夹名称，
inc=False，且不存在path*paths 的路径就创建一个，反之就直接使用，
train.py 中，*paths = experiment, 若experiment 非默认 ‘’，
update_summary(epoch, train_metrics, eval_metrics, filename, write_header=False, log_wandb=False)，train,valid 结果写入csv

6.6.3. log.py

setup_default_logging(default_level=logging.INFO, log_path=‘’)

6.6.4. metrics.py

定义了class AverageMeter，accuracy

accuracy(output, target, topk=(1,))

参数：
output 为预测概览矩阵，大小为batch_size * num(label)，
target 为实际 label 矩阵，大小为 1* batch_size
topk一般设为(1, n), 表示最多取预测矩阵的前 n 个最大值，当 n > num(label) 也没关系，maxk = min(max(topk), output.size()[1]) 会处理，因为函数中
, pred = output.topk(maxk, 1, True, True)，实际最多取预测矩阵的前 maxk 个最大值，其中 "" 是预测概率，pred 是索引，即表示预测出的 label
输出：
本 batch 中最大预测概率对应的索引就是正确 label 的样本数量占比，本 batch 中，前 maxk 个最大预测概率对应的索引包含正确 label 的样本数量占比。

6.6.5. misc.py

natural_key(string_) 给文档排序sorted 的key 赋值

7. avg_checkpoint.py

作用是匹配指定路径上的所有模型权重的过滤器通配符。为了取得较好的结果，这些checkpoint必须来源于相同训练。

8 benchmark.py

是timm模型的推理和训练步骤基准的脚本。

9 inference.py

是一个示例推理脚本，将文件夹中的图像的top-k类id输出到csv中。

10 train.py

这是一个精简的、易于修改的ImageNet训练脚本，可以重新生成ImageNet，训练结果与一些最新的网络和训练技术。它倾向于规范的PyTorch和标准的Python风格，也就是说，提供了很多的训练速度和改进结果的PyTorch示例脚本，自己可以自由选择。

参数

These arguments are to define Dataset/Model parameters, Optimizer parameters, Learnining Rate scheduler parameters, Augmentation and regularization, Batch Norm parameters, Model exponential moving average parameters, and some miscellaneaous parameters such as --seed, --tta etc.
Do note that some random augmentations are set by default such as color_jitter, hfliip but there is a parameter no-aug in case you wanted to turn of all training data augmentations. Also, the default optimizer opt is ’sgd’ but it is possible to change that. timm offers a vast number of optimizers to train your models with.

Column 1	Column 2
–aa	Auto-Augment

运行

Distributed Training on multiple GPUs
To train models on multiple GPUs, simply replace python train.py with ./distributed_train.sh like so:

./distributed_train.sh 4 ./imagenette2-320 --aug-splits 3 --jsd

This trains the model using AugMix data augmentation on 4 GPUs.

step and args

loss function
args.jsd_loss, args.aug_splits, args.smoothing
args.mixup, args.cutmix, args.cutmix_minmax, args.bce_loss, args.bce_target_thresh
分布式计算
args.local_rank：gpu_id, 为int类型变量，只能指定一张显卡，默认0
另一个作用，当 args.local_rank== 0，写 log，有时save train_batch_image
train
用不用wandb, 首先要装载wandb，使得has_wandb =True，还涉及 args.log_wandb, 默认是False
augmentation,mixup,cutmix
Attention
当使用混合精度时，train_one_epoch 与 validate 会报错，

Error: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

因为没有把input 放到gpu, 若 args.prefetcher =False , 需要将

if not args.prefetcher: 
改成
if args.prefetcher:

11 vaildate.py

是一个精简且易于修改的ImageNet验证脚本，其功能与train.py类似。

11.1. 参数

name	annotation	default or example	relation
valid-labels	一个 .txt 文件的路径，里面放着需要验证的部分或全部标签索引，一行一个，内容会赋值给变量valid_labels， output = model(input) output = output[:, valid_labels]	default=‘’
real-labels	路径, Real labels JSON file for imagenet evaluation	default=‘’