[ICML 2018] RLlib: Abstractions for Distributed Reinforcement Learning

https://arxiv.org/abs/1712.09381

Introduction

many of the challenges in reinforcement learning stem from the need to scale learning and simulation while also integrating a rapidly increasing range of algorithms and models.

many of the frameworks used by these libraries rely on communication between long-running program replicas for distributed execution, such as MPI, Distributed TensorFlow and parameter servers. it does not naturally encapsulate parallelism and resource requirements within individual components.

We believe that the ability to build scalable RL algorithms by composing and reusing existing components and implementations is essential for the rapid development and progress of the field. Toward this end, we argue for structuring distributed RL components around the principles of logically centralized program control and parallelism encapsulation.

Logically Centralized Control & Hierarchical Control

  • Distributed Control. Most RL algorithms today are written in a fully distributed style where replicated processes independently compute and coordinate with each other according to their roles (if any).
  • Logically Centralized Control. a single driver program can delegate algorithm sub-tasks to other processes to execute in parallel.
  • Hierarchical Control. To support nested computations, we propose extending the centralized control model with hierarchical delegation of control, which allows the worker processes to further delegate work to sub-workers of their own when executing tasks.

在这里插入图片描述

Hierarchical Parallel Task Model

  • Distributed Control. Parallelization of entire programs using frameworks like MPI and Distributed Tensorflow typically require explicit algorithm modifications to insert points of coordina- tion when trying to compose two programs or components together.
  • Hierarchical Control based on Ray. Ray meets this requirement with Ray actors, which are Python classes that may be created in the cluster and accept remote method calls. Ray permits these actors to in turn launch more actors and schedule tasks on those actors as part of a method call, satisfying our need for hierarchical delegation as well.

在这里插入图片描述

Abstractions for Reinforcement Learning

在这里插入图片描述

Policy Graph

To interface with RLlib, these algorithm functions should be defined in a policy graph class with the following methods:
在这里插入图片描述

Policy Evaluation

RLlib provides a PolicyEvaluator class that wraps a policy graph and environment to add a method to sample() experience batches. Policy evaluator instances can be created as Ray remote actors and replicated across a cluster for parallelism.

在这里插入图片描述

Policy Optimization

The policy optimizer is responsible for the performance-critical tasks of distributed sampling, parameter updates, and managing replay buffers. To distribute the computation, the optimizer operates over a set of policy evaluator replicas.

在这里插入图片描述

Pseudocode for four RLlib policy optimizer step methods. Each step() operates over a local policy graph and array of remote evaluator replicas.

在这里插入图片描述
For details in (c), see https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html

current_weights = ps.get_weights.remote()

gradients = {}
for worker in workers:
    gradients[worker.compute_gradients.remote(current_weights)] = worker

for i in range(iterations * num_workers):
    ready_gradient_list, _ = ray.wait(list(gradients))
    ready_gradient_id = ready_gradient_list[0]
    worker = gradients.pop(ready_gradient_id)

    # Compute and apply gradients.
    current_weights = ps.apply_gradients.remote(*[ready_gradient_id])
    gradients[worker.compute_gradients.remote(current_weights)] = worker

    if i % 10 == 0:
        # Evaluate the current model after every 10 updates.
        model.set_weights(ray.get(current_weights))
        accuracy = evaluate(model, test_loader)
        print("Iter {}: \taccuracy is {:.1f}".format(i, accuracy))

print("Final accuracy is {:.1f}.".format(accuracy))

All optimizers in RLlib.

__all__ = [
    "PolicyOptimizer",
    "AsyncReplayOptimizer",
    "AsyncSamplesOptimizer",
    "AsyncGradientsOptimizer",
    "SyncSamplesOptimizer",
    "SyncReplayOptimizer",
    "LocalMultiGPUOptimizer",
    "SyncBatchReplayOptimizer",
]

Framework Performance

Fault tolerance and straggler mitigation

Failure events become significant at scale. RLlib leverages Ray’s built-in fault tolerance mechanisms, reducing costs with preemptible cloud compute instances. Similarly, stragglers can significantly impact the performance of distributed algorithms at scale. RLlib supports straggler mitigation in a generic way via the ray.wait() primitive. For example, in PPO we use this to drop the slowest evaluator tasks, at the cost of some bias.

Data compression

RLlib uses the LZ4 algorithm to compress experience batches. For image observations, LZ4 reduces network traffic and memory usage by more than an order of magnitude, at a compression rate of∼1 GB/s/core.

Evaluation

AWS m4.16xl CPU instances, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html

在这里插入图片描述

p2.16xl GPU instance, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html

在这里插入图片描述

x1.16xl, see [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html]

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值