https://github.com/lyuwenyu/RL
Reinforcement Learning
1.
MDP( Markov Decision Process) :
(S, A, P, R, r) PI
S ( state)
A ( action )
r (discount)
R (reward)
PI (policy)
G (Return)
Bellman equation
State-value function v(s)
Action-value function q(s,a)
Optimal state-value function
Optimal action-value function
Optimal policy
2.
Model-based solution
Dynamic Programming
Value Iteration
Policy Iteration:
Policy evaluation
Policy improve (greedy)
3.
Model-free solution
Policy Evaluation
MC (Monte Carlo)
TD (Temporal Difference)
4.
on policy
off policy
SARSA
QLearning
off-policy: It is called an off-policy because the policy being learned can be different than the policy being executed
on-policy: it updates value functions strictly on the basis of the experience gained from executing some (possibly non-stationary) policy
-----------------------reference-----------------------------
1. https://www.youtube.com/watch?v=0g4j2k_Ggc4
2. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
3. http://www.algorithmdog.com/reinforcement-learning-value-function-approximation
4.