Ray----Tune(6):Tune 的实例(一)

最新推荐文章于 2024-04-25 20:33:03 发布

快乐地笑

最新推荐文章于 2024-04-25 20:33:03 发布

阅读量5.5k

点赞数 1

分类专栏：学习 ray 文章标签： ray

本文链接：https://blog.csdn.net/weixin_43255962/article/details/89416847

版权

本文详细介绍了Ray Tune的使用示例，包括自定义日志记录器、AsyncHyperBandScheduler、HyperBandScheduler、基于函数的API优化、PopulationBasedTraining调度器在训练模型中的应用，以及Keras示例。通过具体的代码和输出结果，展示了如何在不同场景下进行超参数调优和训练管理。

摘要由CSDN通过智能技术生成

在我们的存储库中，我们为各种用例和调优特性提供了各种示例。

一、一般的例子：

１．自定义日志记录器和自定义试验名命名的示例。

ray/python/ray/tune/examples/logging_example.py

源码：

#!/usr/bin/env python

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import json
import os
import random

import numpy as np

import ray
from ray import tune
from ray.tune import Trainable, run, Experiment

# 日志函数
class TestLogger(tune.logger.Logger):
    def on_result(self, result):
        print("TestLogger", result)

# 实验命名函数
# trial.trainable_name  调用训练器的名字
# trial.trial_id　　　   训练器的ID
def trial_str_creator(trial):
    return "{}_{}_123".format(trial.trainable_name, trial.trial_id)


class MyTrainableClass(Trainable):
    """Example agent whose learning curve is a random sigmoid.

    The dummy hyperparameters "width" and "height" determine the slope and
    maximum reward value reached.
    """
	# 初始化
    def _setup(self, config):
        self.timestep = 0
	# 训练函数
    def _train(self):
        self.timestep += 1
        v = np.tanh(float(self.timestep) / self.config["width"])
        v *= self.config["height"]


        # 这里我们使用`episode_reward_mean`，也可以报告其他目标，例如loss或accuracy。
        # 可以选择　episode_reward_mean, mean_loss, mean_accuracy 和 timesteps_this_iter .
        return {"episode_reward_mean": v}

    # 保存训练的检查点到　checkpoint_dir　文件下生成一个json文件。　
    # 路径也可以自定义，tune.run()中的参数local_dir可以设置位置
    # 默认情况下是在 '~/ray_results/实验名'下（此代码运行后是在'~/ray_results/hyperband_test'下）
    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    # 从给定的检查点恢复函数（训练出错时调用）
    def _restore(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]


if __name__ == "__main__":
    parser = argparse.ArgumentParser()　#　命令行解析函数
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    #	ray启动，如果是在ray的分布式集群上使用，启动需要加上对应主节点的ip和端口好　例如：redis_address="192.168.10.1:6379"
    ray.init()		
    exp = Experiment(
        name="hyperband_test", 	# 实验名
        run=MyTrainableClass,	# 试验（trial）名
        #  num_samples 从超参数空间抽样两次，此代码即训练两次可训练函数MyTrainableClass
        num_samples=2,
        # 可训练函数重命名
        trial_name_creator=tune.function(trial_str_creator),
        loggers=[TestLogger], # 调用自定义日志函数
        #试验终止条件，此处不是试验早期停止条件
        stop={"training_iteration": 1 if args.smoke_test else 999},
        # random.random()生成0和1之间的随机浮点数float
        # config　参数可以理解为参数空间	，试验训练过程自定义的参数头可以通过此参数传到试验中
        #　参数的含义是　 width取值空间为[10, 100], height　取值空间为　[0, 100]
        config={
            "width": tune.sample_from(lambda spec: 10 + int(90 * random.random())),
            "height": tune.sample_from(lambda spec: int(100 * random.random()))
        })

    trials = run(exp)

实验结果：

/usr/bin/python3.5 /home/kangkang/PycharmProjects/ray/python/ray/tune/examples/logging_example.py
2019-04-21 16:54:51,571	INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-21_16-54-51_20983/logs.
2019-04-21 16:54:51,674	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:13169 to respond...
2019-04-21 16:54:51,800	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:33321 to respond...
2019-04-21 16:54:51,801	INFO services.py:760 -- Starting Redis shard with 3.35 GB max memory.
2019-04-21 16:54:51,819	INFO services.py:1384 -- Starting the Plasma object store with 5.03 GB memory using /dev/shm.
2019-04-21 16:54:51,898	INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-04-21 16:54:51,898	INFO tune.py:211 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 3.9/16.8 GB

2019-04-21 16:54:52,556	WARNING util.py:62 -- The `start_trial` operation took 0.6490788459777832 seconds to complete, which may be a performance bottleneck.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.3/16.8 GB
Result logdir: /home/kangkang/ray_results/hyperband_test
Number of trials: 2 ({'RUNNING': 1, 'PENDING': 1})
PENDING trials:
 - MyTrainableClass_245fefff_123:	PENDING
RUNNING trials:
 - MyTrainableClass_4969aace_123:	RUNNING

Result for MyTrainableClass_245fefff_123:
  date: 2019-04-21_16-54-53
  done: false
  episode_reward_mean: 0.19046719267795137
  experiment_id: 7263a08cbc554d5a9256fef77c290c34
  hostname: kangkang-1994
  iterations_since_restore: 1
  node_ip: 192.168.4.102
  pid: 21022
  time_since_restore: 5.173683166503906e-05
  time_this_iter_s: 5.173683166503906e-05
  time_total_s: 5.173683166503906e-05
  timestamp: 1555836893
  timesteps_since_restore: 0
  training_iteration: 1
  
TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'config': {'height': 16, 'width': 84}, 'experiment_id': '7263a08cbc554d5a9256fef77c290c34', 'training_iteration': 1, 'time_total_s': 5.173683166503906e-05, 'done': False, 'iterations_since_restore': 1, 'episodes_total': None, 'time_since_restore': 5.173683166503906e-05, 'pid': 21022, 'time_this_iter_s': 5.173683166503906e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'date': '2019-04-21_16-54-53', 'timesteps_total': None, 'episode_reward_mean': 0.19046719267795137}
Result for MyTrainableClass_4969aace_123:
  date: 2019-04-21_16-54-53
  done: false
  episode_reward_mean: 1.7676692568367371
  experiment_id: 512eb0621175451c84ea143d724f840f
  hostname: kangkang-1994
  iterations_since_restore: 1
  node_ip: 192.168.4.102
  pid: 21015
  time_since_restore: 2.86102294921875e-05
  time_this_iter_s: 2.86102294921875e-05
  time_total_s: 2.86102294921875e-05
  timestamp: 1555836893
  timesteps_since_restore: 0
  training_iteration: 1
  
TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'experiment_id': '512eb0621175451c84ea143d724f840f', 'pid': 21015, 'date': '2019-04-21_16-54-53', 'time_total_s': 2.86102294921875e-05, 'config': {'height': 99, 'width': 56}, 'iterations_since_restore': 1, 'episode_reward_mean': 1.7676692568367371, 'episodes_total': None, 'time_since_restore': 2.86102294921875e-05, 'time_this_iter_s': 2.86102294921875e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'done': False, 'timesteps_total': None, 'training_iteration': 1}
TestLogger ............
.......
.......
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 998, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 1.3589859008789062e-05, 'iterations_since_restore': 998, 'timestamp': 1555837061, 'done': False, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018785715103149414, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018785715103149414, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}
2019-04-21 16:57:41,062	INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_2ef074ed_123. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Result for MyTrainableClass_2ef074ed_123:
  date: 2019-04-21_16-57-41
  done: true
  episode_reward_mean: 78.99999999999999
  experiment_id: 1c364c8bf6fa43c5b0f8aae7c231b098
  hostname: kangkang-1994
  iterations_since_restore: 999
  node_ip: 192.168.4.102
  pid: 21247
  time_since_restore: 0.018810272216796875
  time_this_iter_s: 2.4557113647460938e-05
  time_total_s: 0.018810272216796875
  timestamp: 1555837061
  timesteps_since_restore: 0
  training_iteration: 999
  
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 999, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 2.4557113647460938e-05, 'iterations_since_restore': 999, 'timestamp': 1555837061, 'done': True, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018810272216796875, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018810272216796875, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.5/16.8 GB
Result logdir: /home/kangkang/ray_results/hyperband_test
Number of trials: 2 ({'TERMINATED': 2})
TERMINATED trials:
 - MyTrainableClass_fb55cbbf_123:	TERMINATED, [1 CPUs, 0 GPUs], [pid=21243], 0 s, 999 iter, 51 rew
 - MyTrainableClass_2ef074ed_123:	TERMINATED, [1 CPUs, 0 GPUs], [pid=21247], 0 s, 999 iter, 79 rew


Process finished with exit code 0

此示例中日志主要是通过自定义一个TestLogger(tune.logger.Logger)，然后在实验中添加一个配置信息loggers=[TestLogger],最后在控制台上打印出每个迭代试验的状态信息。即下面的所述：

TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'experiment_id': '512eb0621175451c84ea143d724f840f', 'pid': 21015, 'date': '2019-04-21_16-54-53', 'time_total_s': 2.86102294921875e-05, 'config': {'height': 99, 'width': 56}, 'iterations_since_restore': 1, 'episode_reward_mean': 1.7676692568367371, 'episodes_total': None, 'time_since_restore': 2.86102294921875e-05, 'time_this_iter_s': 2.86102294921875e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'done': False, 'timesteps_total': None, 'training_iteration': 1}
TestLogger ............
.......
.......
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 998, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 1.3589859008789062e-05, 'iterations_since_restore': 998, 'timestamp': 1555837061, 'done': False, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018785715103149414, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018785715103149414, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}

自定义可训练函数名：
首先通过可训练命名函数　trial_str_creator(trial)　把可训练函数转换成字符串形式，然后通过设置实验参数 trial_name_creator=tune.function(trial_str_creator)　完成命名。其中参数trial_str_creator可自定义。
本代码通过trial_str_creator(trial)函数把可训练函数转换为为{训练名}_{训练ＩＤ}_123。

２．使用带有AsyncHyperBandScheduler的Trainable类的示例。

ray/python/ray/tune/examples/async_hyperband_example.py
源码：

#!/usr/bin/env python

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import json
import os
import random

import numpy as np

import ray
from ray.tune import Trainable, run, sample_from
from ray.tune.schedulers import AsyncHyperBandScheduler


class MyTrainableClass(Trainable):
    """Example agent whose learning curve is a random sigmoid.

    The dummy hyperparameters "width" and "height" determine the slope and
    maximum reward value reached.
    """

    def _setup(self, config):
        self.timestep = 0

    def _train(self):
        self.timestep += 1
        v = np.tanh(float(self.timestep) / self.config["width"])
        v *= self.config["height"]

        # Here we use `episode_reward_mean`, but you can also report other
        # objectives such as loss or accuracy.
        return {"episode_reward_mean": v}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    ray.init()

    # asynchronous hyperband early stopping, configured with
    # `episode_reward_mean` as the
    # objective and `training_iteration` as the time unit,
    # which is automatically filled by Tune.
    # time_attr 时间基元;               reward_attr　目标条件;
    # grace_period　至少的时间（次数）;   max_t 最大的时间（次数）
    # 调度算法的为异步超带　AsyncHyperBandScheduler　。默认是的先进先出（FIFO）
    ahb = AsyncHyperBandScheduler(
        time_attr="training_iteration",
        reward_attr="episode_reward_mean",
        grace_period=5,
        max_t=100)

    run(MyTrainableClass,
        name="asynchyperband_test",
        scheduler=ahb, 	# 引用调度算
        **{
            # 停止条件　training_iteration=99999
            "stop": {
                "training_iteration": 1 if args.smoke_test else 99999
            },
            # 抽样30次,　即通过AsyncHyperBandScheduler调度　对MyTrainableClass进行30次训练。
            "num_samples": 30,
            # 机器资源　每个试验用一个CPU和GPU资源，此处cpu资源为主机cpu核心数。
            "resources_per_trial": {
                "cpu": 1,
                "gpu": 0
            },
            "config": {
                "width": sample_from(
                    lambda spec: 10 + int(90 * random.random())),
                "height": sample_from(lambda spec: int(100 * random.random())),
            },
        })

结果：

/usr/bin/python3.5 /home/kangkang/PycharmProjects/ray/python/ray/tune/examples/async_hyperband_example.py
2019-04-21 21:59:12,417	INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-21_21-59-12_19182/logs.
2019-04-21 21:59:12,520	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:32072 to respond...
2019-04-21 21:59:12,640	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:47740 to respond...
2019-04-21 21:59:12,643	INFO services.py:760 -- Starting Redis shard with 3.35 GB max memory.
2019-04-21 21:59:12,662	INFO services.py:1384 -- Starting the Plasma object store with 5.03 GB memory using /dev/shm.
2019-04-21 21:59:12,766	INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-04-21 21:59:12,766	INFO tune.py:211 -- Starting a new experiment.
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 45.000: None | Iter 15.000: None | Iter 5.000: None
Bracket: Iter 45.000: None | Iter 15.000: None
Bracket: Iter 45.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.3/16.8 GB

2019-04-21 21:59:14,132	WARNING util.py:62 -- The `start_trial` operation took 1.1828277111053467 seconds to complete, which may be a performance bottleneck.
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 45.000: None | Iter 15.000: None | Iter 5.000: None
Bracket: Iter 45.000: None | Iter 15.000: None
Bracket: Iter 45.000: None
Resources requested: 1/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.8/16.8 GB
Result logdir: /home/kangkang/ray_results/asynchyperband_test
Number of trials: 30 ({'RUNNING': 1, 'PENDING': 29})
PENDING trials:
 - MyTrainableClass_1_height=7,width=57:	PENDING
 - MyTrainableClass_2_height=77,width=25:	PENDING
 - MyTrainableClass_3_height=24,width=71:	PENDING
 - MyTrainableClass_4_height=4,width=19:	PENDING
 - MyTrainableClass_5_height=98,width=55:	PENDING
 - MyTrainableClass_6_height=36,width=99:	PENDING
 - MyTrainableClass_7_height=57,width=44:	PENDING
 - MyTrainableClass_8_height=94,width=99:	PENDING
 - MyTrainableClass_9_height=76,width=80:	PENDING
  ... 11 not shown
 - MyTrainableClass_21_height=7,width=76:	PENDING
 - MyTrainableClass_22_height=97,width=74:	PENDING
 - MyTrainableClass_23_height=14,width=53:	PENDING
 - MyTrainableClass_24_height=6,width=17:	PENDING
 - MyTrainableClass_25_height=69,width=21:	PENDING
 - MyTrainableClass_26_height=83,width=55:	PENDING
 - MyTrainableClass_27_height=19,width=78:	PENDING
 - MyTrainableClass_28_height=34,width=85:	PENDING
 - MyTrainableClass_29_height=35,width=57:	PENDING
RUNNING trials:
 - MyTrainableClass_0_height=54,width=90:	RUNNING

Result for MyTrainableClass_0_height=54,width=90:
  date: 2019-04-21_21-59-14
  done: false
  episode_reward_mean: 0.5999753098612407
  experiment_id: dbb12a9d42ec4107bceca3feb43d783f
  hostname: kangkang-1994
  iterations_since_restore: 1
  node_ip: 192.168.4.102
  pid: 19219
  time_since_restore: 3.0040740966796875e-05
  time_this_iter_s: 3.0040740966796875e-05
  time_total_s: 3.0040740966796875e-05
  timestamp: 1555855154
  timesteps_since_restore: 0
  training_iteration: 1
  
Result for MyTrainableClass_1_height=7,width=57:
......
......
......
2019-04-21 21:59:22,944	INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_26_height=83,width=55. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Result for MyTrainableClass_25_height=69,width=21:
  date: 2019-04-21_21-59-23
  done: true
  episode_reward_mean: 68.98991422160057
  experiment_id: 2554df5c37ed477fab6390a2f3a9aec5
  hostname: kangkang-1994
  iterations_since_restore: 100
  node_ip: 192.168.4.102
  pid: 19605
  time_since_restore: 0.0022301673889160156
  time_this_iter_s: 1.239776611328125e-05
  time_total_s: 0.0022301673889160156
  timestamp: 1555855163
  timesteps_since_restore: 0
  training_iteration: 100
  
2019-04-21 21:59:23,202	INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_25_height=69,width=21. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
== Status ==
Using AsyncHyperBand: num_stopped=30
Bracket: Iter 45.000: 42.57614813705181 | Iter 15.000: 24.071527675803438 | Iter 5.000: 4.667534561014101
Bracket: Iter 45.000: 66.05974774234495 | Iter 15.000: 19.528068736719277
Bracket: Iter 45.000: 3.9304791553474514
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.7/16.8 GB
Result logdir: /home/kangkang/ray_results/asynchyperband_test
Number of trials: 30 ({'TERMINATED': 30})
TERMINATED trials:
 - MyTrainableClass_0_height=54,width=90:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19219], 0 s, 100 iter, 43.4 rew
 - MyTrainableClass_1_height=7,width=57:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19218], 0 s, 5 iter, 0.612 rew
 - MyTrainableClass_2_height=77,width=25:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19221], 0 s, 100 iter, 76.9 rew
 - MyTrainableClass_3_height=24,width=71:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19217], 0 s, 5 iter, 1.69 rew
 - MyTrainableClass_4_height=4,width=19:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19215], 0 s, 100 iter, 4 rew
 - MyTrainableClass_5_height=98,width=55:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19216], 0 s, 100 iter, 93 rew
 - MyTrainableClass_6_height=36,width=99:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19220], 0 s, 5 iter, 1.82 rew
 - MyTrainableClass_7_height=57,width=44:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19222], 0 s, 15 iter, 18.7 rew
 - MyTrainableClass_8_height=94,width=99:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19365], 0 s, 100 iter, 72 rew
 - MyTrainableClass_9_height=76,width=80:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19362], 0 s, 100 iter, 64.5 rew
 - MyTrainableClass_10_height=80,width=31:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19364], 0 s, 100 iter, 79.7 rew
 - MyTrainableClass_11_height=24,width=27:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19366], 0 s, 15 iter, 12.1 rew
 - MyTrainableClass_12_height=58,width=48:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19435], 0 s, 100 iter, 56.2 rew
 - MyTrainableClass_13_height=14,width=49:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19430], 0 s, 15 iter, 4.16 rew
 - MyTrainableClass_14_height=8,width=78:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19374], 0 s, 15 iter, 1.52 rew
 - MyTrainableClass_15_height=80,width=60:	TERMINATED, [1 CPUs, 0 GPUs], [pid=19360], 0 s, 15 iter, 19.6 rew
 - MyTrainableClass_16_height=33,width=55:	TERMINATED, [1 CPUs, 0 GP