Spinning Up USER DOCUMENTATION
(For review)
Environments
Spinning Up requires Python3, OpenAI Gym, and OpenMPI. MuJoCo(Optional but Preferred)
Algorithms
VPG, TRPO, PPO, DDPG, TD3, SAC.
The On-Policy Algorithms
Vanilla Policy Gradient(VPG) is the most basic, entry-level algorithm in the deep RL space. It led to stronger algorithms such as TRPO and PPO.
On-Policy:
-
don’t use old data.
-
They are weaker on sample efficiency.
-
These algorithms directly optimize the policy performance. They trades off sample efficiency in favor of stability.
-
TRPO and PPO are mean to make up the deficit on sample efficiency.
The Off-Policy Algorithms
Foundational algorithm just like VPG. DDPG is closely connected to Q-learning algorithms, and it concurrently learns a Q-function and a policy which are updated to improve each other.
Off-Policy:
-
can reuse old data very efficiently.
-
Bellman’s equations for optimality,
-
Q-function can be trained to satisfy using any environment interaction data (as long as there’s enough experience from the high-reward areas in the environment).
-
Unstable: no guarantees of having a great policy performance.
-
TD3 and SAC are mean to make up the deficit on unstable.
Code Format
All implementations in Spinning Up adhere to a standard template. They are split into two files: an algorithm file, which contains the core logic of the algorithm, and a core file, which contains various utilities needed to run the algorithm.