https://arxiv.org/abs/1712.09381
Introduction
many of the challenges in reinforcement learning stem from the need to scale learning and simulation while also integrating a rapidly increasing range of algorithms and models.
many of the frameworks used by these libraries rely on communication between long-running program replicas for distributed execution, such as MPI, Distributed TensorFlow and parameter servers. it does not naturally encapsulate parallelism and resource requirements within individual components.
We believe that the ability to build scalable RL algorithms by composing and reusing existing components and implementations is essential for the rapid development and progress of the field. Toward this end, we argue for structuring distributed RL components around the principles of logically centralized program control and parallelism encapsulation.
Logically Centralized Control & Hierarchical Control
- Distributed Control. Most RL algorithms today are written in a fully distributed style where replicated processes independently compute and coordinate with each other according to their roles (if any).
- Logically Centralized Control. a single driver program can delegate algorithm sub-tasks to other processes to execute in parallel.
- Hierarchical Control. To support nested computations, we propose extending the centralized control model with hierarchical delegation of control, which allows the worker processes to further delegate work to sub-workers of their own when executing tasks.
Hierarchical Parallel Task Model
- Distributed Control. Parallelization of entire programs using frameworks like MPI and Distributed Tensorflow typically require explicit algorithm modifications to insert points of coordina- tion when trying to compose two programs or components together.
- Hierarchical Control based on Ray. Ray meets this requirement with Ray actors, which are Python classes that may be created in the cluster and accept remote method calls. Ray permits these actors to in turn launch more actors and schedule tasks on those actors as part of a method call, satisfying our need for hierarchical delegation as well.
Abstractions for Reinforcement Learning
Policy Graph
To interface with RLlib, these algorithm functions should be defined in a policy graph class with the following methods:
Policy Evaluation
RLlib provides a PolicyEvaluator class that wraps a policy graph and environment to add a method to sample() experience batches. Policy evaluator instances can be created as Ray remote actors and replicated across a cluster for parallelism.
Policy Optimization
The policy optimizer is responsible for the performance-critical tasks of distributed sampling, parameter updates, and managing replay buffers. To distribute the computation, the optimizer operates over a set of policy evaluator replicas.
Pseudocode for four RLlib policy optimizer step methods. Each step() operates over a local policy graph and array of remote evaluator replicas.
For details in (c), see https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html
current_weights = ps.get_weights.remote()
gradients = {}
for worker in workers:
gradients[worker.compute_gradients.remote(current_weights)] = worker
for i in range(iterations * num_workers):
ready_gradient_list, _ = ray.wait(list(gradients))
ready_gradient_id = ready_gradient_list[0]
worker = gradients.pop(ready_gradient_id)
# Compute and apply gradients.
current_weights = ps.apply_gradients.remote(*[ready_gradient_id])
gradients[worker.compute_gradients.remote(current_weights)] = worker
if i % 10 == 0:
# Evaluate the current model after every 10 updates.
model.set_weights(ray.get(current_weights))
accuracy = evaluate(model, test_loader)
print("Iter {}: \taccuracy is {:.1f}".format(i, accuracy))
print("Final accuracy is {:.1f}.".format(accuracy))
All optimizers in RLlib.
__all__ = [
"PolicyOptimizer",
"AsyncReplayOptimizer",
"AsyncSamplesOptimizer",
"AsyncGradientsOptimizer",
"SyncSamplesOptimizer",
"SyncReplayOptimizer",
"LocalMultiGPUOptimizer",
"SyncBatchReplayOptimizer",
]
Framework Performance
Fault tolerance and straggler mitigation
Failure events become significant at scale. RLlib leverages Ray’s built-in fault tolerance mechanisms, reducing costs with preemptible cloud compute instances. Similarly, stragglers can significantly impact the performance of distributed algorithms at scale. RLlib supports straggler mitigation in a generic way via the ray.wait() primitive. For example, in PPO we use this to drop the slowest evaluator tasks, at the cost of some bias.
Data compression
RLlib uses the LZ4 algorithm to compress experience batches. For image observations, LZ4 reduces network traffic and memory usage by more than an order of magnitude, at a compression rate of∼1 GB/s/core.
Evaluation
AWS m4.16xl CPU instances, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html
p2.16xl GPU instance, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html
x1.16xl, see [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html]