[Ray.Tune]使用心得(待完善)_ray.tune目标检测-CSDN博客

本文链接：https://blog.csdn.net/m0_38052500/article/details/121930929

首先，report中参数，是自行指定的，而参数对应的值需要在程序中有出现，这一点不需要赘述。
同时在report中指定的参数，将会在Ray运行的过程中以表格的形式展现。
比如，

tune.report(loss=(mean_loss), accuracy=test_accuracy, accuracy2= test_accuracy)
# =======================
+---------------------+------------+---------------------+------+--------+------------------+---------+------------+-------------+
| Trial name          | status     | loc                 |   lr |   iter |   total time (s) |    loss |   accuracy |   accuracy2 |
|---------------------+------------+---------------------+------+--------+------------------+---------+------------+-------------|
| DEFAULT_8aa09_00000 | TERMINATED | 172.27.67.94:290338 | 0.01 |      5 |         137.438  | 1.39491 |    53.7037 |     53.7037 |
| DEFAULT_8aa09_00001 | TERMINATED | 172.27.67.94:290340 | 0.1  |      1 |          29.1316 | 1.50628 |    48.8889 |     48.8889 |
+---------------------+------------+---------------------+------+--------+------------------+---------+------------+-------------+

其次，在report中指定的指标，自然也可以当作相关API中参数的候选值被使用，比如接口analysis.get_best_config(metric="accuracy", mode="max"))

logger.info("Best config: {}".format(analysis.get_best_config(metric="accuracy", mode="max")))
logger.info("Best config: {}".format(analysis.get_best_config(metric="accuracy2", mode="max")))
logger.info("Best config: {}".format(analysis.get_best_config(metric="loss", mode="min")))
# =================================
Best config: {'lr': 0.01}
Best config: {'lr': 0.01}
Best config: {'lr': 0.01}

默认地，ray.tune运行时包含的字典的键有以下：
在这里插入图片描述
以上内容是在超参数仅学习率，且学习率可选值未0.1和0.01两个值时得到的结果。
该结果通过analysis.dataframe()函数输出，并通过to_csv保存为CSV文件得到。

除此之外，还有一种全局控制输出字段的方式是：

from ray.tune import CLIReporter
reporter = CLIReporter(metric_columns=["loss", "mean_accuracy2", "training_iteration"])

# =======================
+---------------------+------------+---------------------+------+---------+----------------------+
| Trial name          | status     | loc                 |   lr |    loss |   training_iteration |
|---------------------+------------+---------------------+------+---------+----------------------|
| DEFAULT_2d6b3_00000 | RUNNING    | 172.27.67.94:295017 | 0.01 | 1.54909 |                    1 |
| DEFAULT_2d6b3_00001 | TERMINATED | 172.27.67.94:295016 | 0.1  | 2.01156 |                    1 |
+---------------------+------------+---------------------+------+---------+----------------------+

在CLIReporter中指定的指标，将会取代reporter函数的指定内容，但并不是完全覆盖，因为CLIReporter中指定的指标只有在reporter中被赋予值后才会在状态表中显示。

Tune的常用API参考界面

1 搜索空间：提供了定义超参数搜索空间的函数。

需要注意，预使用的搜索算法要衡量是否与搜索空间中的函数相适用，因为sample_from和grid_search通常并不受支持。

2 搜索算法：开源优化库的包装器，特定的搜索空间定义方式，用于高效的超参数选择

3 调度程序：提前终止不良试验，暂停试验，克隆试验以及更改正在运行的试验的超参数，以优化超算数

4 教程和常见问题：指导如何选择搜索算法和调度算法以及重现试验

5 使用 RAY TUNE 调整超参数：Pytorch官网对Ray Tune的介绍，包含一个完整的小栗子

+ 经典的框架

1 凤头

即，配置一些基本的内容。
在这里插入图片描述
一共包含上面六个部分，但是除此之外，还有一处代码设计需要注意：一般地，参数值可以分为两种，需要调整config和不需要调整的参数arg=parser.parse_args()。分开的方式就是：对于tune.run中调用的主函数spot，通过function.partial方法将其与arg捆绑在一起。

搜索空间的定义方式

tune.run(
    trainable,
    config={
        "param1": tune.choice([True, False]), # 从所列出的有限值中选择
        "bar": tune.uniform(0, 10), # 从一个遵从均匀分布的生成器中选择
        "alpha": tune.sample_from(lambda _: np.random.uniform(100) ** 2),
        "const": "hello",  # 面向常量
        "bar": tune.grid_search([True, False]),
    })

完整的代码如下：


def spot(arg, config, checkpoint_dir=None):
    import copy
    arg = copy.deepcopy(arg)
	
	# 为了尽可能地不去修改原始的代码文件，将yaml文件中需要调整的参数在该函数中替换修改。
    arg.base_lr = config["lr"]
    arg.batch_size = config["batch_size"]
    arg.weight_decay = config["weight_decay"]
    arg.step = config["step"]
    arg.optim_args["momentum"] = config["momentum"]
    arg.warm_up_epoch = config["warm_up_epoch"]

    processor = Processor(arg, logdir, model_save_dir)
    processor.start()


if __name__ == '__main__':
    dataType = "skeletons"  # ["images",skeletons]
    yamlConfigPath = './config/dad/clip225/none/agcn/train_bone.yaml'
    logger.info("配置文件：{}".format(yamlConfigPath))
    parser = get_parser(yamlConfigPath)
    p = parser.parse_args()  # 当前py文件中预定义好的参数字典。

    if p.config is not None:
        with open(p.config, 'r') as f:
            default_arg = yaml.full_load(f)  # yaml文件中的参数字典
        key = vars(p).keys()
        for k in default_arg.keys():
            if k not in key:
                print('WRONG ARG: {}'.format(k))
                assert (k in key)
        parser.set_defaults(**default_arg)
    arg = parser.parse_args()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    init_seed(arg.seed)

    # 所有的参数在这里之前都要确定下来。
    with open('{}/config.yaml'.format(logdir), 'w') as f:
        yaml.dump(vars(arg), f)

    config = {
        "lr": tune.choice([0.1, 0.01, 0.001, 0.05, 0.005]),
        "batch_size": tune.grid_search([16, 32, 64, 128]),
        "weight_decay": tune.grid_search([0.0001, 0.0003, 0.0005]),
        "step": tune.grid_search([[45, 55, 65],
                                  [30, 50],
                                  [20, 40, 60]]),
        "momentum": tune.grid_search([0.7, 0.8, 0.9]),
        "warm_up_epoch": tune.grid_search([0, 5, 10])
    }

    scheduler = ASHAScheduler(
        metric="accuracy",
        mode="max",
        max_t=65,
        grace_period=1,
        reduction_factor=2)

    reporter = CLIReporter(
        metric_columns=["loss", "accuracy", "training_iteration"])

    analysis = tune.run(
        partial(spot, arg),
        num_samples=200,
        resources_per_trial={"cpu": 16, "gpu": 1},
        config=config,
        scheduler=scheduler,
        progress_reporter=reporter
    )

2 猪肚

核心还是Processor类，由其中的start函数启动训练。
原始的项目代码基本上没有修改，经由spot函数连接Ray.tune和processor。
耦合度得到提升。

3 龙尾

# 获取结果的 dataframe
import pandas as pd
pd.set_option('display.max_columns', None)

# 1 返回一个由所有实验信息构建成的pandas.DataFrame对象。
df = analysis.dataframe()
df.to_csv("df_testcsv.csv")
logger.info("df:\n{}".format(df))

# 2 返回 List of all dataframes of the trials.
dfs = analysis.trial_dataframes
logger.info("dfs:\n{}".format(dfs))

# 3 显示各个实验的结果。
from matplotlib.pyplot import plt
ax = None  # This plots everything on the same plot
for d in dfs.values():
    ax = d.loss.plot(ax=ax, legend=True)

plt.xlabel("epoch")
plt.ylabel("Test Accuracy")
plt.show()

# 4 显示最好的配置结果。
logger.info("Best config: {}".format(analysis.get_best_config(metric="accuracy", mode="max")))
logger.info("Best config: {}".format(analysis.get_best_config(metric="loss", mode="min")))

best_trial = analysis.get_best_trial("loss", "min", "last")
logger.info("Best trial config1: {}".format(best_trial.config))
logger.info("Best trial final validation loss: {}".format(best_trial.last_result["loss"]))
logger.info("Best trial final validation accuracy: {}".format(best_trial.last_result["accuracy"]))

best_trial = analysis.get_best_trial("accuracy", "max", "last")
logger.info("Best trial config2: {}".format(best_trial.config))
logger.info("Best trial final validation loss: {}".format(best_trial.last_result["loss"]))
logger.info("Best trial final validation accuracy: {}".format(best_trial.last_result["accuracy"]))

analysis = tune.run(trainable, search_alg=algo, stop={"training_iteration": 20})

best_trial = analysis.best_trial  # Get best trial
best_config = analysis.best_config  # Get best trial's hyperparameters
best_logdir = analysis.best_logdir  # Get best trial's logdir
best_checkpoint = analysis.best_checkpoint  # Get best trial's best checkpoint
best_result = analysis.best_result  # Get best trial's last results
best_result_df = analysis.best_result_df  # Get best result as pandas dataframe

# 5 加持最后的配置参数得到的结果。
import copy
arg = copy.deepcopy(arg)
arg.base_lr = best_trial.config["lr"]
arg.batch_size = best_trial.config["batch_size"]
arg.weight_decay = best_trial.config["weight_decay"]
arg.step = best_trial.config["step"]
arg.optim_args["momentum"] = best_trial.config["momentum"]
arg.warm_up_epoch = best_trial.config["warm_up_epoch"]
processor = Processor(arg, logdir, model_save_dir)
processor.start()

4 其他内容

1 记录日志库loguru不能使用其写入文档的功能

记录日志库loguru不能使用其写入文档的功能，但是简单的终端日志输出功能，并不受影响。

查看dataframes对象的内容

1 控制台完整显示dataframes对象内容

pd.set_option('display.max_columns',a) #a就是你要设置显示的最大列数参数
pd.set_option('display.max_rows',b) #b就是你要设置显示的最大的行数参数
pd.set_option('display.width',x) #x就是你要设置的显示的宽度，防止轻易换行

如果希望默认全部输出，则将以上三个API的第二个参数全部设置为None

2 保存为本地文件

df.to_csv("df_testcsv.csv") 相关的配置保持默认设置。

+ 翻译

1 Tutorial: Accelerated Hyperparameter Tuning For PyTorch

本教程中，我们将要展示如何使用先进的调参技术Tune。
具体地，我们通过HyperOpt方式，使用ASHA和贝叶斯优化，实现的同时不需要修改已完成的代码。

Code: https://github.com/ray-project/ray/tree/master/python/ray/tune
Examples: https://github.com/ray-project/ray/tree/master/python/ray/tune/examples
Documentation: http://ray.readthedocs.io/en/latest/tune.html
Mailing List https://groups.google.com/forum/#!forum/ray-dev

首先，我们引入一些有用的函数，比如，将要在循环中一步步使用的train函数。
同时，为了在训练的过程中做些决策，我们需要使训练函数通知Tune，而tune.track API会使得Tune库对当前结果保持跟踪。为此，我们需要将tune.track.log(mean_accuracy=acc)添加到训练的循环过程中来。

但是目前比较常见的是使用tune.report(mean_accuracy=acc).

实验运行的例子
我们先ian运行一个实验，随机地从学习率和动量的均匀分布中随机采样。这里的试验trial，是指使用一组参数进行实验训练。而实验experiment是指一系列的试验trials的集合。

tune.loguniform(0.0001,0.1)，表示差十倍地均匀分布地采样。
ray.shutdown() 重新启动Ray，以防与ray的连接不会丢失。
ray.init(log_to_dirver=False) If true, the output from all of the worker processes on all nodes will be directed to the driver. 如果为true，则所有节点上所有工作进程的输出都将定向到驱动程序。ray.init 表示本地启动Ray和所有相关的进程。API中的参数配置基本上普通场景是不会用到的。
- ray.init(num_cpus=8, num_gpus=1)可以明确指定相关的资源。
- _temp_dir指定Ray进程的根临时目录。默认为linux中的/tmp/ray。
tune.run 执行训练。
- 当接收到sigint信号时，比如通过Ctrl+C，tune将正常关闭，且检查最新的实验状态。而再次发送sigint信号，将跳过这一步；（以下的参数介绍，仅涉及新了解到的内容，基本的说明将省去）
- run_or_experiment：要训练的算法或者模型，可以是用户定义的可训练的函数或者类。
- metric：需要优化的指标。而该指标应该是通过tune.report()报告（这一点尤为注意），如果设置，则将传递给搜索算法和调度程序。
- mode：必须是[min,max]之一。确定目标是最小化还是最大化度量属性。
- name：实验名称。需要额外补充的是：如果他是一个完整的路径，比如，name="/root/home/train_mnist"，则日志的文件路径将为该name的名称：Result logdir: /root/home/train_mnist；而如果不是完整的路径，比如name="home/train_mnist"，则日志文件的存储路径是：Result logdir: /root/ray_results/home/train_mnist。默认的路径是在Result logdir: /root/ray_results/run_experiment name。
- stop：停止的条件，如果是dict，则键可以是report函数中定义的metric的任意指标（原文是：键可以是 'train()' 返回结果中的任何字段）以先到达者为准。
- time_budget_s：以秒为单位的全局时间预算，在此之后的所有试验都将停止。对于时间有限的场景，这一点可以充分利用，因为如果只是单纯地指出num_samples，其实并不知道最终得到结果需要多长的时间。
- resources_per_trial ：每次试验分配的机器资源，格式为：{"cpu": x, "gpu": x}。除非在此处执行，否则将不会分配任何GPU资源，因此默认的配置是1个CPU和0个GPU。
- num_samples ：从超参数空间采样的次数，默认为1；如果该值为-1，则会生成无限个样本，直到满足停止条件。
- local_dir：保存训练结果的本地目录。默认为~/ray_results，这一点需要和name参数中将的内容融合，如果保持该参数的默认值，同时将name的值修改为路径，则日志文件的地址将为name的值。当然，也可以回归本质，local_dir就负责路径的前缀，而name就是简单的字符串名称，而不涉及路径。
- search_alg ：从一个错误中，得知其候选项包含：ValueError: The search_alg argument must be one of [‘variant_generator’, ‘random’, ‘ax’, ‘dragonfly’, ‘skopt’, ‘hyperopt’, ‘bayesopt’, ‘bohb’, ‘nevergrad’, ‘optuna’, ‘zoopt’, ‘sigopt’, ‘hebo’, ‘blendsearch’, ‘cfo’]. Got: todo。这一点详情见搜索算法tune.suggest，篇幅过大，暂不加赘述。
- scheduler ：执行实验的调度程序，在五个候选程序中选择，分别是 FIFO（默认）、MedianStopping、AsyncHyperBand、HyperBand 和 PopulationBasedTraining 。这五种程序有相关论文，可以深研试读。
- verbose：详细程度，四个等级。0表示无声，1表示仅更新状态，2表示状态和简要的实验结果，3表示状态和详细的实验结果。
- progress_reporter ：用于报告中概念实验进度的进度报告器，如果在命令行中运行，则默认使用CLIReporter，如果是Jupyter，则为JupyterNotebookReporter。
- log_to_file ：将 stdout 和 stderr 记录到 Tune 的试验目录中的文件中。
  - 如果这是False（默认），则不写入任何文件。
  - 如果为true，则输出分别写入trialdir/stdout 和trialdir/stderr。
  - 如果这是单个字符串，则将其解释为相对于trialdir 的文件，两个流都写入该文件。
  - 如果这是一个序列（例如元组），则它的长度必须为 2，并且元素分别指示 stdout 和 stderr 写入的文件。
    这里需要对python stdout有个理解：print将内容输出到终端，是通过调用sys.stdout实现的，因此保存stdout的内容，就可以理解为将所有但不限于终端显示的内容会被写入到文件中。
- export_formats ：实验结束时到处的格式列表。默认无。这一点还是很有扩展性的，如果对可视化比较重视的时候。
- max_failures ：尝试至少多次恢复试验。如果存在，Ray 将从最新的检查点恢复。设置为 -1 将导致无限恢复重试。设置为 0 将禁用重试。默认为 0。
- resume ：“LOCAL”、“REMOTE”、“PROMPT”、“ERRORED_ONLY”或 bool 之一。如果是local或者True，则从本地实验目录中恢复检查点，该路径通过name参数和local_dir参数确定；False强制进行新的实验。
- max_concurrent_trials ：同时运行的最大试验次数。必须是非负数。如果为 None 或 0，则不会应用任何限制。之所以对这个参数敏感，是因为当我使用四个显卡的时候，只有三个显卡被使用，因此在怀疑是否存在控制试验个数的参数，待验证。

+ 教程

1 调整：可扩展的超参数调整

Tune是一个python库，用于任何规模的试验执行和超参数调整。核心功能是：

使用少量代码实现多节点分布式的超参数扫描。（方便）
支持目前流行的机器学习框架。（通用）
自动管理检查点checkpoint，并记录到tensorboard中。
使用最先进的算法，例如基于流行度的训练PBT，贝叶斯优化搜索算法以及超频算法与ASHA。（先进）

2 关键概念

在这里插入图片描述

可训练对象trainable API

要使用Tune，需要将此函数包装在轻量级的可训练API中；该API有两种常用形态：函数的版本以及类的版本，其中的类的版本，需要继承tune的一个类。

有意思的是，tune.report在类中无法使用。
函数版本中，除了使用tune.report，还可以使用python的yield语句向Ray Tune报告指标，同时return的版本似乎也可以。但是为了统一代码起见，仅考虑report的报告指标的方式。
checkpoint暂时先不学习，类版本也不接触。

tune.run 和 Trials

使用tune.run执行超参数调整，此功能管理你的实验experiment并提供许多功能，比如，日志记录、检查点和提前停止。
具体地，run将从他的参数搜索空间中生成几个超参数配置，并将他们包装到Trial试验对象中，而每一个试验还与一个Trainable实例相关联。
（需要注意的是试验和实验在该库的讲解中是不同的概念，实验是一系列试验的集合。）
超参数配置的最终的生成数量，受两个参数的影响：一是num_samples的数值，另一个是config中超参数候选值的定义方式，比如是随机采样还是网格搜索。参考搜索空间（网格/随机）
默认情况下，每个随机变量和网格搜索点都采样一次，而如果grid_search作为参数提供，网格将重复多次。

# num_samples=10 repeats the 3x3 grid search 10 times, for a total of 90 trials
tune.run(
    my_trainable,
    name="my_trainable",
    config={
        "alpha": tune.uniform(100),
        "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
        "nn_layers": [
            tune.grid_search([16, 64, 256]),
            tune.grid_search([16, 64, 256]),
        ],
    },
    num_samples=10
)