CUDA-MODE课程笔记 第6课: 如何优化PyTorch中的优化器

我的课程笔记,欢迎关注:https://github.com/BBuf/how-to-optim-algorithm-in-cuda/tree/master/cuda-mode

CUDA-MODE课程笔记 第6课: 如何优化PyTorch中的优化器

课程内容

上面三张Slides讲述了运行时间(runtime)和内存使用(memory usage)之间的权衡关系。

第一张Slides:

  • 介绍了运行时间和内存使用通常是相互矛盾的。
  • 展示了两种运输车辆:一辆小卡车(代表低内存使用但速度慢)和一辆大卡车(代表高内存使用但速度快)。
  • 提出了一个问题:如果要运送512辆车,应该选择哪种卡车?

第二张Slides:

  • 在第一张图的基础上增加了一个新的限制条件:途中有一座低通桥。
  • 这代表了在某些情况下,我们可能无法简单地选择高内存使用的方案(大卡车),因为存在硬件或系统限制。

第三张Slides:

  • 明确表示"今天我们专注于速度!"
  • 显示了小卡车被划掉,表明选择了大卡车(高内存使用但速度快的方案)。
  • 同时提醒"这确实意味着内存会受到影响,免责声明"。

这张Slides展示了一个naive的优化器实现,核心要点是假设有M个参数,对于每个参数有N个操作,那么遍历所有参数并处理完共需要M * N个操作。

这张Slides介绍了一种称为"horizontally fused optimizer"(水平融合优化器)的优化方法,可以把naive的优化器实现中的for循环fuse掉。

这张Slides介绍了实际上我们可以把整个优化器的操作fuse成一个cuda kernel。

这张Slides传达的核心信息是:在CUDA编程中,通过减少kernel启动的次数可以提高程序的执行效率。这是因为每次启动CUDAkernel都会有一定的开销,如果能够将多个操作合并到更少的kernel中,就可以减少这些开销,从而提高整体性能。水平融合和垂直融合是实现这一目标的两种主要策略:水平融合合并了相似的并行操作;垂直融合则进一步合并了不同的计算步骤。

“batch_size”: 4, # 关键:显存限制下最大安全值 “input_shape”: [256, 256], # 降低计算负载 我并没有在opt文件中找到这两行,请结合我下述opt文件的全部内容,详细向我标出所有要修改的地方: import torch.optim as optim from run_utils.callbacks.base import ( AccumulateRawOutput, PeriodicSaver, ProcessAccumulatedRawOutput, ScalarMovingAverage, ScheduleLr, TrackLr, VisualizeOutput, TriggerEngine, ) from run_utils.callbacks.logging import LoggingEpochOutput, LoggingGradient from run_utils.engine import Events from .targets import gen_targets, prep_sample from .net_desc import create_model from .run_desc import proc_valid_step_output, train_step, valid_step, viz_step_output # TODO: training config only ? # TODO: switch all to function name String for all option def get_config(nr_type, mode): return { # ------------------------------------------------------------------ # ! All phases have the same number of run engine # phases are run sequentially from index 0 to N "phase_list": [ { "run_info": { # may need more dynamic for each network "net": { "desc": lambda: create_model( input_ch=3, nr_types=nr_type, freeze=True, mode=mode ), "optimizer": [ optim.Adam, { # should match keyword for parameters within the optimizer "lr": 1.0e-4, # initial learning rate, "betas": (0.9, 0.999), }, ], # learning rate scheduler "lr_scheduler": lambda x: optim.lr_scheduler.StepLR(x, 25), "extra_info": { "loss": { "np": {"bce": 1, "dice": 1}, "hv": {"mse": 1, "msge": 1}, "tp": {"bce": 1, "dice": 1}, }, }, # path to load, -1 to auto load checkpoint from previous phase, # None to start from scratch "pretrained": "../pretrained/ImageNet-ResNet50-Preact_pytorch.tar", # 'pretrained': None, }, }, "target_info": {"gen": (gen_targets, {}), "viz": (prep_sample, {})}, "batch_size": {"train": 16, "valid": 16,}, # engine name : value "nr_epochs": 50, }, { "run_info": { # may need more dynamic for each network "net": { "desc": lambda: create_model( input_ch=3, nr_types=nr_type, freeze=False, mode=mode ), "optimizer": [ optim.Adam, { # should match keyword for parameters within the optimizer "lr": 1.0e-4, # initial learning rate, "betas": (0.9, 0.999), }, ], # learning rate scheduler "lr_scheduler": lambda x: optim.lr_scheduler.StepLR(x, 25), "extra_info": { "loss": { "np": {"bce": 1, "dice": 1}, "hv": {"mse": 1, "msge": 1}, "tp": {"bce": 1, "dice": 1}, }, }, # path to load, -1 to auto load checkpoint from previous phase, # None to start from scratch "pretrained": -1, }, }, "target_info": {"gen": (gen_targets, {}), "viz": (prep_sample, {})}, "batch_size": {"train": 4, "valid": 8,}, # batch size per gpu "nr_epochs": 50, }, ], # ------------------------------------------------------------------ # TODO: dynamically for dataset plugin selection and processing also? # all enclosed engine shares the same neural networks # as the on at the outer calling it "run_engine": { "train": { # TODO: align here, file path or what? what about CV? "dataset": "", # whats about compound dataset ? "nr_procs": 16, # number of threads for dataloader "run_step": train_step, # TODO: function name or function variable ? "reset_per_run": False, # callbacks are run according to the list order of the event "callbacks": { Events.STEP_COMPLETED: [ # LoggingGradient(), # TODO: very slow, may be due to back forth of tensor/numpy ? ScalarMovingAverage(), ], Events.EPOCH_COMPLETED: [ TrackLr(), PeriodicSaver(), VisualizeOutput(viz_step_output), LoggingEpochOutput(), TriggerEngine("valid"), ScheduleLr(), ], }, }, "valid": { "dataset": "", # whats about compound dataset ? "nr_procs": 8, # number of threads for dataloader "run_step": valid_step, "reset_per_run": True, # * to stop aggregating output etc. from last run # callbacks are run according to the list order of the event "callbacks": { Events.STEP_COMPLETED: [AccumulateRawOutput(),], Events.EPOCH_COMPLETED: [ # TODO: is there way to preload these ? ProcessAccumulatedRawOutput( lambda a: proc_valid_step_output(a, nr_types=nr_type) ), LoggingEpochOutput(), ], }, }, }, }
08-15
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值