在我们的存储库中,我们为各种用例和调优特性提供了各种示例。
一、 一般的例子:
1.自定义日志记录器和自定义试验名命名的示例。
ray/python/ray/tune/examples/logging_example.py
源码:
#!/usr/bin/env python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import json
import os
import random
import numpy as np
import ray
from ray import tune
from ray.tune import Trainable, run, Experiment
# 日志函数
class TestLogger(tune.logger.Logger):
def on_result(self, result):
print("TestLogger", result)
# 实验命名函数
# trial.trainable_name 调用训练器的名字
# trial.trial_id 训练器的ID
def trial_str_creator(trial):
return "{}_{}_123".format(trial.trainable_name, trial.trial_id)
class MyTrainableClass(Trainable):
"""Example agent whose learning curve is a random sigmoid.
The dummy hyperparameters "width" and "height" determine the slope and
maximum reward value reached.
"""
# 初始化
def _setup(self, config):
self.timestep = 0
# 训练函数
def _train(self):
self.timestep += 1
v = np.tanh(float(self.timestep) / self.config["width"])
v *= self.config["height"]
# 这里我们使用`episode_reward_mean`,也可以报告其他目标,例如loss或accuracy。
# 可以选择 episode_reward_mean, mean_loss, mean_accuracy 和 timesteps_this_iter .
return {"episode_reward_mean": v}
# 保存训练的检查点到 checkpoint_dir 文件下生成一个json文件。
# 路径也可以自定义,tune.run()中的参数local_dir可以设置位置
# 默认情况下是在 '~/ray_results/实验名'下(此代码运行后是在'~/ray_results/hyperband_test'下)
def _save(self, checkpoint_dir):
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"timestep": self.timestep}))
return path
# 从给定的检查点恢复函数(训练出错时调用)
def _restore(self, checkpoint_path):
with open(checkpoint_path) as f:
self.timestep = json.loads(f.read())["timestep"]
if __name__ == "__main__":
parser = argparse.ArgumentParser() # 命令行解析函数
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
args, _ = parser.parse_known_args()
# ray启动,如果是在ray的分布式集群上使用,启动需要加上对应主节点的ip和端口好 例如:redis_address="192.168.10.1:6379"
ray.init()
exp = Experiment(
name="hyperband_test", # 实验名
run=MyTrainableClass, # 试验(trial)名
# num_samples 从超参数空间抽样两次,此代码即训练两次可训练函数MyTrainableClass
num_samples=2,
# 可训练函数重命名
trial_name_creator=tune.function(trial_str_creator),
loggers=[TestLogger], # 调用自定义日志函数
#试验终止条件,此处不是试验早期停止条件
stop={"training_iteration": 1 if args.smoke_test else 999},
# random.random()生成0和1之间的随机浮点数float
# config 参数可以理解为参数空间 ,试验训练过程自定义的参数头可以通过此参数传到试验中
# 参数的含义是 width取值空间为[10, 100], height 取值空间为 [0, 100]
config={
"width": tune.sample_from(lambda spec: 10 + int(90 * random.random())),
"height": tune.sample_from(lambda spec: int(100 * random.random()))
})
trials = run(exp)
实验结果:
/usr/bin/python3.5 /home/kangkang/PycharmProjects/ray/python/ray/tune/examples/logging_example.py
2019-04-21 16:54:51,571 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-21_16-54-51_20983/logs.
2019-04-21 16:54:51,674 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:13169 to respond...
2019-04-21 16:54:51,800 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:33321 to respond...
2019-04-21 16:54:51,801 INFO services.py:760 -- Starting Redis shard with 3.35 GB max memory.
2019-04-21 16:54:51,819 INFO services.py:1384 -- Starting the Plasma object store with 5.03 GB memory using /dev/shm.
2019-04-21 16:54:51,898 INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-04-21 16:54:51,898 INFO tune.py:211 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 3.9/16.8 GB
2019-04-21 16:54:52,556 WARNING util.py:62 -- The `start_trial` operation took 0.6490788459777832 seconds to complete, which may be a performance bottleneck.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.3/16.8 GB
Result logdir: /home/kangkang/ray_results/hyperband_test
Number of trials: 2 ({'RUNNING': 1, 'PENDING': 1})
PENDING trials:
- MyTrainableClass_245fefff_123: PENDING
RUNNING trials:
- MyTrainableClass_4969aace_123: RUNNING
Result for MyTrainableClass_245fefff_123:
date: 2019-04-21_16-54-53
done: false
episode_reward_mean: 0.19046719267795137
experiment_id: 7263a08cbc554d5a9256fef77c290c34
hostname: kangkang-1994
iterations_since_restore: 1
node_ip: 192.168.4.102
pid: 21022
time_since_restore: 5.173683166503906e-05
time_this_iter_s: 5.173683166503906e-05
time_total_s: 5.173683166503906e-05
timestamp: 1555836893
timesteps_since_restore: 0
training_iteration: 1
TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'config': {'height': 16, 'width': 84}, 'experiment_id': '7263a08cbc554d5a9256fef77c290c34', 'training_iteration': 1, 'time_total_s': 5.173683166503906e-05, 'done': False, 'iterations_since_restore': 1, 'episodes_total': None, 'time_since_restore': 5.173683166503906e-05, 'pid': 21022, 'time_this_iter_s': 5.173683166503906e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'date': '2019-04-21_16-54-53', 'timesteps_total': None, 'episode_reward_mean': 0.19046719267795137}
Result for MyTrainableClass_4969aace_123:
date: 2019-04-21_16-54-53
done: false
episode_reward_mean: 1.7676692568367371
experiment_id: 512eb0621175451c84ea143d724f840f
hostname: kangkang-1994
iterations_since_restore: 1
node_ip: 192.168.4.102
pid: 21015
time_since_restore: 2.86102294921875e-05
time_this_iter_s: 2.86102294921875e-05
time_total_s: 2.86102294921875e-05
timestamp: 1555836893
timesteps_since_restore: 0
training_iteration: 1
TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'experiment_id': '512eb0621175451c84ea143d724f840f', 'pid': 21015, 'date': '2019-04-21_16-54-53', 'time_total_s': 2.86102294921875e-05, 'config': {'height': 99, 'width': 56}, 'iterations_since_restore': 1, 'episode_reward_mean': 1.7676692568367371, 'episodes_total': None, 'time_since_restore': 2.86102294921875e-05, 'time_this_iter_s': 2.86102294921875e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'done': False, 'timesteps_total': None, 'training_iteration': 1}
TestLogger ............
.......
.......
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 998, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 1.3589859008789062e-05, 'iterations_since_restore': 998, 'timestamp': 1555837061, 'done': False, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018785715103149414, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018785715103149414, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}
2019-04-21 16:57:41,062 INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_2ef074ed_123. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Result for MyTrainableClass_2ef074ed_123:
date: 2019-04-21_16-57-41
done: true
episode_reward_mean: 78.99999999999999
experiment_id: 1c364c8bf6fa43c5b0f8aae7c231b098
hostname: kangkang-1994
iterations_since_restore: 999
node_ip: 192.168.4.102
pid: 21247
time_since_restore: 0.018810272216796875
time_this_iter_s: 2.4557113647460938e-05
time_total_s: 0.018810272216796875
timestamp: 1555837061
timesteps_since_restore: 0
training_iteration: 999
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 999, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 2.4557113647460938e-05, 'iterations_since_restore': 999, 'timestamp': 1555837061, 'done': True, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018810272216796875, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018810272216796875, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.5/16.8 GB
Result logdir: /home/kangkang/ray_results/hyperband_test
Number of trials: 2 ({'TERMINATED': 2})
TERMINATED trials:
- MyTrainableClass_fb55cbbf_123: TERMINATED, [1 CPUs, 0 GPUs], [pid=21243], 0 s, 999 iter, 51 rew
- MyTrainableClass_2ef074ed_123: TERMINATED, [1 CPUs, 0 GPUs], [pid=21247], 0 s, 999 iter, 79 rew
Process finished with exit code 0
此示例中日志主要是通过自定义一个TestLogger(tune.logger.Logger)
,然后在实验中添加一个配置信息loggers=[TestLogger]
,最后在控制台上打印出每个迭代试验的状态信息。即下面的所述:
TestLogger {'node_ip': '192.168.4.102', 'timestamp': 1555836893, 'experiment_id': '512eb0621175451c84ea143d724f840f', 'pid': 21015, 'date': '2019-04-21_16-54-53', 'time_total_s': 2.86102294921875e-05, 'config': {'height': 99, 'width': 56}, 'iterations_since_restore': 1, 'episode_reward_mean': 1.7676692568367371, 'episodes_total': None, 'time_since_restore': 2.86102294921875e-05, 'time_this_iter_s': 2.86102294921875e-05, 'hostname': 'kangkang-1994', 'timesteps_since_restore': 0, 'done': False, 'timesteps_total': None, 'training_iteration': 1}
TestLogger ............
.......
.......
TestLogger {'node_ip': '192.168.4.102', 'training_iteration': 998, 'hostname': 'kangkang-1994', 'timesteps_total': None, 'time_this_iter_s': 1.3589859008789062e-05, 'iterations_since_restore': 998, 'timestamp': 1555837061, 'done': False, 'date': '2019-04-21_16-57-41', 'pid': 21247, 'time_since_restore': 0.018785715103149414, 'experiment_id': '1c364c8bf6fa43c5b0f8aae7c231b098', 'time_total_s': 0.018785715103149414, 'config': {'height': 79, 'width': 53}, 'episode_reward_mean': 78.99999999999999, 'timesteps_since_restore': 0, 'episodes_total': None}
自定义可训练函数名:
首先通过可训练命名函数 trial_str_creator(trial)
把可训练函数转换成字符串形式,然后通过设置实验参数 trial_name_creator=tune.function(trial_str_creator)
完成命名。其中参数trial_str_creator
可自定义。
本代码通过trial_str_creator(trial)
函数把可训练函数转换为为{训练名}_{训练ID}_123
。
2.使用带有AsyncHyperBandScheduler的Trainable类的示例。
ray/python/ray/tune/examples/async_hyperband_example.py
源码:
#!/usr/bin/env python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import json
import os
import random
import numpy as np
import ray
from ray.tune import Trainable, run, sample_from
from ray.tune.schedulers import AsyncHyperBandScheduler
class MyTrainableClass(Trainable):
"""Example agent whose learning curve is a random sigmoid.
The dummy hyperparameters "width" and "height" determine the slope and
maximum reward value reached.
"""
def _setup(self, config):
self.timestep = 0
def _train(self):
self.timestep += 1
v = np.tanh(float(self.timestep) / self.config["width"])
v *= self.config["height"]
# Here we use `episode_reward_mean`, but you can also report other
# objectives such as loss or accuracy.
return {"episode_reward_mean": v}
def _save(self, checkpoint_dir):
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"timestep": self.timestep}))
return path
def _restore(self, checkpoint_path):
with open(checkpoint_path) as f:
self.timestep = json.loads(f.read())["timestep"]
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
args, _ = parser.parse_known_args()
ray.init()
# asynchronous hyperband early stopping, configured with
# `episode_reward_mean` as the
# objective and `training_iteration` as the time unit,
# which is automatically filled by Tune.
# time_attr 时间基元; reward_attr 目标条件;
# grace_period 至少的时间(次数); max_t 最大的时间(次数)
# 调度算法的为异步超带 AsyncHyperBandScheduler 。默认是的先进先出(FIFO)
ahb = AsyncHyperBandScheduler(
time_attr="training_iteration",
reward_attr="episode_reward_mean",
grace_period=5,
max_t=100)
run(MyTrainableClass,
name="asynchyperband_test",
scheduler=ahb, # 引用调度算
**{
# 停止条件 training_iteration=99999
"stop": {
"training_iteration": 1 if args.smoke_test else 99999
},
# 抽样30次, 即通过AsyncHyperBandScheduler调度 对MyTrainableClass进行30次训练。
"num_samples": 30,
# 机器资源 每个试验用一个CPU和GPU资源,此处cpu资源为主机cpu核心数。
"resources_per_trial": {
"cpu": 1,
"gpu": 0
},
"config": {
"width": sample_from(
lambda spec: 10 + int(90 * random.random())),
"height": sample_from(lambda spec: int(100 * random.random())),
},
})
结果:
/usr/bin/python3.5 /home/kangkang/PycharmProjects/ray/python/ray/tune/examples/async_hyperband_example.py
2019-04-21 21:59:12,417 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-21_21-59-12_19182/logs.
2019-04-21 21:59:12,520 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:32072 to respond...
2019-04-21 21:59:12,640 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:47740 to respond...
2019-04-21 21:59:12,643 INFO services.py:760 -- Starting Redis shard with 3.35 GB max memory.
2019-04-21 21:59:12,662 INFO services.py:1384 -- Starting the Plasma object store with 5.03 GB memory using /dev/shm.
2019-04-21 21:59:12,766 INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-04-21 21:59:12,766 INFO tune.py:211 -- Starting a new experiment.
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 45.000: None | Iter 15.000: None | Iter 5.000: None
Bracket: Iter 45.000: None | Iter 15.000: None
Bracket: Iter 45.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.3/16.8 GB
2019-04-21 21:59:14,132 WARNING util.py:62 -- The `start_trial` operation took 1.1828277111053467 seconds to complete, which may be a performance bottleneck.
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 45.000: None | Iter 15.000: None | Iter 5.000: None
Bracket: Iter 45.000: None | Iter 15.000: None
Bracket: Iter 45.000: None
Resources requested: 1/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.8/16.8 GB
Result logdir: /home/kangkang/ray_results/asynchyperband_test
Number of trials: 30 ({'RUNNING': 1, 'PENDING': 29})
PENDING trials:
- MyTrainableClass_1_height=7,width=57: PENDING
- MyTrainableClass_2_height=77,width=25: PENDING
- MyTrainableClass_3_height=24,width=71: PENDING
- MyTrainableClass_4_height=4,width=19: PENDING
- MyTrainableClass_5_height=98,width=55: PENDING
- MyTrainableClass_6_height=36,width=99: PENDING
- MyTrainableClass_7_height=57,width=44: PENDING
- MyTrainableClass_8_height=94,width=99: PENDING
- MyTrainableClass_9_height=76,width=80: PENDING
... 11 not shown
- MyTrainableClass_21_height=7,width=76: PENDING
- MyTrainableClass_22_height=97,width=74: PENDING
- MyTrainableClass_23_height=14,width=53: PENDING
- MyTrainableClass_24_height=6,width=17: PENDING
- MyTrainableClass_25_height=69,width=21: PENDING
- MyTrainableClass_26_height=83,width=55: PENDING
- MyTrainableClass_27_height=19,width=78: PENDING
- MyTrainableClass_28_height=34,width=85: PENDING
- MyTrainableClass_29_height=35,width=57: PENDING
RUNNING trials:
- MyTrainableClass_0_height=54,width=90: RUNNING
Result for MyTrainableClass_0_height=54,width=90:
date: 2019-04-21_21-59-14
done: false
episode_reward_mean: 0.5999753098612407
experiment_id: dbb12a9d42ec4107bceca3feb43d783f
hostname: kangkang-1994
iterations_since_restore: 1
node_ip: 192.168.4.102
pid: 19219
time_since_restore: 3.0040740966796875e-05
time_this_iter_s: 3.0040740966796875e-05
time_total_s: 3.0040740966796875e-05
timestamp: 1555855154
timesteps_since_restore: 0
training_iteration: 1
Result for MyTrainableClass_1_height=7,width=57:
......
......
......
2019-04-21 21:59:22,944 INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_26_height=83,width=55. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Result for MyTrainableClass_25_height=69,width=21:
date: 2019-04-21_21-59-23
done: true
episode_reward_mean: 68.98991422160057
experiment_id: 2554df5c37ed477fab6390a2f3a9aec5
hostname: kangkang-1994
iterations_since_restore: 100
node_ip: 192.168.4.102
pid: 19605
time_since_restore: 0.0022301673889160156
time_this_iter_s: 1.239776611328125e-05
time_total_s: 0.0022301673889160156
timestamp: 1555855163
timesteps_since_restore: 0
training_iteration: 100
2019-04-21 21:59:23,202 INFO ray_trial_executor.py:178 -- Destroying actor for trial MyTrainableClass_25_height=69,width=21. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
== Status ==
Using AsyncHyperBand: num_stopped=30
Bracket: Iter 45.000: 42.57614813705181 | Iter 15.000: 24.071527675803438 | Iter 5.000: 4.667534561014101
Bracket: Iter 45.000: 66.05974774234495 | Iter 15.000: 19.528068736719277
Bracket: Iter 45.000: 3.9304791553474514
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 4.7/16.8 GB
Result logdir: /home/kangkang/ray_results/asynchyperband_test
Number of trials: 30 ({'TERMINATED': 30})
TERMINATED trials:
- MyTrainableClass_0_height=54,width=90: TERMINATED, [1 CPUs, 0 GPUs], [pid=19219], 0 s, 100 iter, 43.4 rew
- MyTrainableClass_1_height=7,width=57: TERMINATED, [1 CPUs, 0 GPUs], [pid=19218], 0 s, 5 iter, 0.612 rew
- MyTrainableClass_2_height=77,width=25: TERMINATED, [1 CPUs, 0 GPUs], [pid=19221], 0 s, 100 iter, 76.9 rew
- MyTrainableClass_3_height=24,width=71: TERMINATED, [1 CPUs, 0 GPUs], [pid=19217], 0 s, 5 iter, 1.69 rew
- MyTrainableClass_4_height=4,width=19: TERMINATED, [1 CPUs, 0 GPUs], [pid=19215], 0 s, 100 iter, 4 rew
- MyTrainableClass_5_height=98,width=55: TERMINATED, [1 CPUs, 0 GPUs], [pid=19216], 0 s, 100 iter, 93 rew
- MyTrainableClass_6_height=36,width=99: TERMINATED, [1 CPUs, 0 GPUs], [pid=19220], 0 s, 5 iter, 1.82 rew
- MyTrainableClass_7_height=57,width=44: TERMINATED, [1 CPUs, 0 GPUs], [pid=19222], 0 s, 15 iter, 18.7 rew
- MyTrainableClass_8_height=94,width=99: TERMINATED, [1 CPUs, 0 GPUs], [pid=19365], 0 s, 100 iter, 72 rew
- MyTrainableClass_9_height=76,width=80: TERMINATED, [1 CPUs, 0 GPUs], [pid=19362], 0 s, 100 iter, 64.5 rew
- MyTrainableClass_10_height=80,width=31: TERMINATED, [1 CPUs, 0 GPUs], [pid=19364], 0 s, 100 iter, 79.7 rew
- MyTrainableClass_11_height=24,width=27: TERMINATED, [1 CPUs, 0 GPUs], [pid=19366], 0 s, 15 iter, 12.1 rew
- MyTrainableClass_12_height=58,width=48: TERMINATED, [1 CPUs, 0 GPUs], [pid=19435], 0 s, 100 iter, 56.2 rew
- MyTrainableClass_13_height=14,width=49: TERMINATED, [1 CPUs, 0 GPUs], [pid=19430], 0 s, 15 iter, 4.16 rew
- MyTrainableClass_14_height=8,width=78: TERMINATED, [1 CPUs, 0 GPUs], [pid=19374], 0 s, 15 iter, 1.52 rew
- MyTrainableClass_15_height=80,width=60: TERMINATED, [1 CPUs, 0 GPUs], [pid=19360], 0 s, 15 iter, 19.6 rew
- MyTrainableClass_16_height=33,width=55: TERMINATED, [1 CPUs, 0 GP