torch+hyperopt搭建DNN调参细节

ElevSa

已于 2024-02-20 14:38:00 修改

阅读量935

点赞数 5

文章标签： dnn 深度学习机器学习

于 2024-02-05 16:03:02 首次发布

本文链接：https://blog.csdn.net/weixin_50929400/article/details/136027684

版权

好久没自己从头搭建模型，而且hyperopt在torch的应用示例较少，所以记录一些debug遇到的问题点
——————

1、hyperopt无法正常生成搜索空间

翻看库的源代码，fmin导入的参数函数fn只能由搜索空间作为首个传入参数，最好是唯一的参数

fn : callable (trial point -> loss)
        This function will be called with a value generated from `space`
        as the first and possibly only argument.  It can return either
        a scalar-valued loss, or a dictionary.  A returned dictionary must
        contain a 'status' key with a value from `STATUS_STRINGS`, must
        contain a 'loss' key if the status is `STATUS_OK`. Particular
        optimization algorithms may look for other keys as well.  An
        optional sub-dictionary associated with an 'attachments' key will
        be removed by fmin its contents will be available via
        `trials.trial_attachments`. The rest (usually all) of the returned
        dictionary will be stored and available later as some 'result'
        sub-dictionary within `trials.trials`.

2、输入数据维度

模型的input_dim和output_dim对应的是数据集的特征量。

3、损失函数输出为NAN

数据集存在空值，反向传播时导致损失计算为NAN，需要对缺失值进行处理。
解决方法举例：df=df.dropna()
另：问题可以通过函数进行检查：with torch.autograd.detect_anomaly():
又另：虽然正常跑通但是损失值看着还是有点高，可能之后需要考虑一下normalization

4、outputs和labels维度不对应

报错提示：类似于：UserWarning: Using a target size (torch.Size([64,1])) that is different to the input size (torch.Size([64,])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
参考：https://blog.csdn.net/xll_bit/article/details/123906121
解决方法：在forward函数返回输出前，对outputs（[64,1]）用torch.squeeze()与labels的维度（[64,]）进行降维对齐形状，否则torch采用自动broadcast会影响loss计算。（如果要升维，用torch.unsqueeze）

5、hp.choice

hp.choice返回值为选择范围列表的索引；
但是看示例都可以直接用params['label']在模型中引入参数，实践发现在模型中应用时产生了可选列表不存在的值（0、1等）
报错提示：

{'activation': 0, 'batch_size': 0, 'layers': 1, 'learning_rate': 2, 'regularization_rate': 2, 'units1': 2, 'units2': 0, 'units3': 1, 'units4': 7}
/.../torch/lib/python3.11/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")

搜索空间：

space={'units1':hp.choice('units1',[4,8,16,32,64,128,256,512]),
       'units2':hp.choice('units2',[4,8,16,32,64,128,256,512]),
       'units3':hp.choice('units3',[4,8,16,32,64,128,256,512]),
       'units4':hp.choice('units4',[4,8,16,32,64,128,256,512]),
       'layers':hp.choice('layers',[2,3,4]),
       'batch_size':hp.choice('batch_size',[32,64,128,256,500]),
       'learning_rate':hp.choice('learning_rate',[0.001,0.01,0.1]),
       'regularization_rate':hp.choice('regularization_rate',[0,0.001,0.01,0.1]),  # 优化器为了正则化设置的weight_decay
       'activation':hp.choice('activation',['relu','softplus'])
}

参考：https://docs.azure.cn/zh-cn/databricks/machine-learning/automl-hyperparam-tuning/hyperopt-best-practices
语句是训练结束以下标形式返回最优参数，可以在fmin函数里设置return_argmin=False返回数值

6、early_stop_fn

发现没有跑完设定的epoch就中止了，一番排查以后发现是参考的代码设置了early_stop_fn。
参考：https://zhuanlan.zhihu.com/p/629690012

from hyperopt.early_stop import no_progress_loss

early_stop_fn用于提前停止参数，一般从hyperopt库导入的方法no_progress_loss()，这个方法中可以输入具体的数字n，表示当损失连续n次没有下降时，让算法提前停止。由于贝叶斯方法的随机性较高，当样本量不足时需要多次迭代才能够找到最优解。