文章目录
Breif Introduction for Reinforcement Learning I (Background Info)
Markov Chain
Markov Decision Process
M = ( S , A , P , R ) M=(S,A,P,R) M=(S,A,P,R)
States :
s
i
∈
S
s_i\in S
si∈S
Actions :
a
i
∈
A
a_i\in A
ai∈A
Probability distribution of transitions :
p
(
s
′
∣
s
,
a
)
∈
P
s
a
p(s'|s,a)\in P_{sa}
p(s′∣s,a)∈Psa
Reward :
r
(
s
′
∣
s
,
a
)
r(s'|s,a)
r(s′∣s,a)
Value function: Bellman Equation
RL learns a policy π : S → A \pi:S\rightarrow A π:S→A. Reward function R R R reflects only the real time REWARD. For a long term REWARD, we introduce Value Function V π ( s ) V^{\pi}(s) Vπ(s).
- State Value Function
V π ( s ) = ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ V π ( s ′ ) ] V^{\pi}(s)=\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^\pi(s')] Vπ(s)=s′∈S∑p(s′∣s,π(s))[r(s′∣s,π(s))+γVπ(s′)] - Action Value Function
Q ( s , a ) = ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ V π ( s ′ ) ] Q(s,a)=\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma V^\pi(s')] Q(s,a)=s′∈S∑p(s′∣s,a)[r(s′∣s,a)+γVπ(s′)] - Connection
In this section, we have 2 different value functions for states and actions. Consider V V V as a specialized version of Q Q Q with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with V V V.
V π ( s ) = Q ( s , π ( s ) ) V^{\pi}(s)=Q(s,\pi(s)) Vπ(s)=Q(s,π(s)) - Difference
Q Q Q is defined on actions, but V V V is defined on states. - MDP Best Policy
π ∗ = arg max π V π ( s ) , ∀ s ∈ S \pi^*=\arg\max_\pi V^\pi(s), \forall s\in S π∗=argπmaxVπ(s),∀s∈S
Basic Solutions
Dynamic Programming (?)
Policy Iteration
- Policy Evaluation
For a given policy π \pi π, Policy Evaluation algorithm calculates values of states v ( s ) v(s) v(s).
ALGORITHM: Policy_Evaluation
Input: π ( a ∣ s ) \pi(a|s) π(a∣s), the mixed policy to be evaluted.
Initialize v ( s ) = 0 v(s)=0 v(s)=0 for all s ∈ S s\in S s∈S
RepeatΔ ← 0 \Delta\leftarrow0 Δ←0
For each s ∈ S s\in S s∈St m p ← ∑ a π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ v ( s ′ ) ] tmp\leftarrow\sum_{a}\pi(a|s)\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')] tmp←∑aπ(a∣s)∑s′∈Sp(s′∣s,π(s))[r(s′∣s,π(s))+γv(s′)]
Δ ← max ( Δ , ∣ t m p − v ( s ) ∣ ) \Delta\leftarrow\max(\Delta,|tmp-v(s)|) Δ←max(Δ,∣tmp−v(s)∣)
v ( s ) ← t m p v(s)\leftarrow tmp v(s)←tmpUntil Δ < θ \Delta<\theta Δ<θ (a small positive threshold)
Output: v ≈ v ∗ v\approx v^* v≈v∗, the approximate values of states.
- Policy Improvement
For a given policy π \pi π and values of states v ( s ) v(s) v(s), Policy Improvement algorithm can achieve a better policy with v ( s ) v(s) v(s) untouched.
ALGORITHM: Policy_Improvement
Input: π ( s ) \pi(s) π(s), v ( s ) v(s) v(s).
Repeatpolicy_stable ← t r u e \leftarrow true ←true
For each s ∈ S s\in S s∈St m p ← arg max a ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ v ( s ′ ) ] tmp\leftarrow\arg\max_a\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')] tmp←argmaxa∑s′∈Sp(s′∣s,π(s))[r(s′∣s,π(s))+γv(s′)]
If t m p ≠ π ( s ) tmp\not=\pi(s) tmp=π(s) Then policy_stable ← f a l s e \leftarrow false ←false
π ( s ) ← t m p \pi(s)\leftarrow tmp π(s)←tmpUntil policy_stable = t r u e =true =true
Output: π = π ∗ \pi= \pi^* π=π∗, the Improved policy.
- Policy Iteration
Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows:
π 0 → E v 0 → I π 1 → E v 1 → I π 2 → ⋯ → E v ∗ → I π ∗ \pi_0\rightarrow^{E} v_0\rightarrow^{I} \pi_1\rightarrow^{E} v_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} v^*\rightarrow^{I} \pi^* π0→Ev0→Iπ1→Ev1→Iπ2→⋯→Ev∗→Iπ∗
ALGORITHM: Policy_Iteration
Initializate v ( s ) ∈ R v(s)\in R v(s)∈R and π ( s ) ∈ A ( s ) \pi(s)\in A(s) π(s)∈A(s) randomly for all s ∈ S s\in S s∈S
Repeatv ( s ) ← v(s)\leftarrow v(s)← Policy_Evaluation ( π ) (\pi) (π)
π ′ ( s ) ← \pi'(s)\leftarrow π′(s)← Policy_Improvement ( π , v ) (\pi,v) (π,v)
policy_stable ← t r u e \leftarrow true ←true
If π ≠ π ′ \pi\not=\pi' π=π′ Then policy_stable ← f a l s e \leftarrow false ←false
π ← π ′ \pi\leftarrow\pi' π←π′Until policy_stable = t r u e =true =true
Output: π , v \pi, v π,v
Value Iteration
Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all s s s for one time.
ALGORITHM: Value_Iteration
Initializate v ( s ) ∈ R v(s)\in R v(s)∈R randomly for all s ∈ S s\in S s∈S
RepeatΔ ← 0 \Delta\leftarrow0 Δ←0
For each s ∈ S s\in S s∈St m p ← max a ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ v ( s ′ ) ] tmp\leftarrow\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')] tmp←maxa∑s′∈Sp(s′∣s,a)[r(s′∣s,a)+γv(s′)]
Δ ← max ( Δ , ∣ t m p − v ( s ) ∣ ) \Delta\leftarrow\max(\Delta,|tmp-v(s)|) Δ←max(Δ,∣tmp−v(s)∣)
v ( s ) ← t m p v(s)\leftarrow tmp v(s)←tmpUntil Δ < θ \Delta<\theta Δ<θ (a small positive threshold)
For each s ∈ S s\in S s∈Sπ ( s ) ← arg max a ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ v ( s ′ ) ] \pi(s)\leftarrow \arg\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')] π(s)←argmaxa∑s′∈Sp(s′∣s,a)[r(s′∣s,a)+γv(s′)]
Output: π \pi π
Pros andc Cons
- pros
- interpretable
- mathematical deduction based
- cons
- require complete environmental information
Monte Carlo
MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists s s s) and every-visit MC methods (number of s s s). In the section, we discuss first-visit MC methods only.
Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.
Monte Carlo Policy Evalution
- Input: The policy to be evaluted
- Step1: generate some state sequences (each sequence is a episode)
- Step2: for each state, calculate the average reward among all episodes where exists s s s
- Step3: set the average rewards as values of states
Mote Carlo Estimation of Action Values
To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows: π ′ ( s ) = arg max a Q π ( s , a ) \pi'(s)=\arg\max_a Q^\pi(s,a) π′(s)=argamaxQπ(s,a)
Maintaining Exploration
There is a problem for MC method. If we already have predescribed Q-values: Q ( s , a 1 ) Q(s,a_1) Q(s,a1) Q ( s , a 1 ) Q(s,a_1) Q(s,a1) and Q ( s , a 1 ) > Q ( s , a 2 ) Q(s,a_1)>Q(s,a_2) Q(s,a1)>Q(s,a2), Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, ϵ − g r e e d y \epsilon-greedy ϵ−greedy policy: execute the best action with a probability of 1 − ϵ 1-\epsilon 1−ϵ, otherwise execute those worse actions. Decrease ϵ \epsilon ϵ by time and the algorithm will converge.
Mote Carlo Control
The process of Mote Carlo Control is as follows:
π
0
→
E
q
0
→
I
π
1
→
E
q
1
→
I
π
2
→
⋯
→
E
q
∗
→
I
π
∗
\pi_0\rightarrow^{E} q_0\rightarrow^{I} \pi_1\rightarrow^{E} q_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} q^*\rightarrow^{I} \pi^*
π0→Eq0→Iπ1→Eq1→Iπ2→⋯→Eq∗→Iπ∗
We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.
Pros and Cons
- pros
- based on experience rather than the whole environment
- cons
- worked on episode tasks only
Temporal-Difference
TD Prediction
Consider the Bellman Equation for value
V
π
(
s
t
)
=
E
π
[
R
(
s
t
+
1
)
+
γ
V
π
(
s
t
+
1
)
∣
s
t
+
1
=
π
(
s
t
)
]
V_\pi(s_t)=E_\pi[R(s_{t+1})+\gamma V_\pi(s_{t+1})|s_{t+1}=\pi(s_t)]
Vπ(st)=Eπ[R(st+1)+γVπ(st+1)∣st+1=π(st)]
When policy
π
\pi
π is fixed, we have
V
(
s
t
)
=
R
(
s
t
+
1
)
+
γ
V
(
s
t
+
1
)
V(s_t)=R(s_{t+1})+\gamma V(s_{t+1})
V(st)=R(st+1)+γV(st+1)
Then we have td_error
t
d
_
e
r
r
o
r
=
∣
R
(
s
t
+
1
)
+
γ
V
π
(
s
t
+
1
)
−
V
π
(
s
t
)
∣
td\_error=|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|
td_error=∣R(st+1)+γVπ(st+1)−Vπ(st)∣
To optimize the model, all we need is to modify policy
π
\pi
π to minimize td_error as follows
π
∗
=
arg
min
π
∣
R
(
s
t
+
1
)
+
γ
V
π
(
s
t
+
1
)
−
V
π
(
s
t
)
∣
\pi^*=\arg\min_\pi|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|
π∗=argπmin∣R(st+1)+γVπ(st+1)−Vπ(st)∣
N-step TD
Consider the definition of state value
V
(
s
t
)
=
R
(
s
t
+
1
)
+
γ
R
(
s
t
+
2
)
+
⋯
+
γ
n
−
1
R
(
s
t
+
n
)
+
γ
n
V
(
s
t
+
n
)
V(s_t)=R(s_{t+1})+\gamma R(s_{t+2})+\cdots+\gamma^{n-1} R(s_{t+n})+\gamma^{n} V(s_{t+n})
V(st)=R(st+1)+γR(st+2)+⋯+γn−1R(st+n)+γnV(st+n)
Similarly in TD algorithm, we can reform state value and achieve new td_error with any step
m
m
m. If we set
n
→
inf
n\rightarrow\inf
n→inf, then it will degenerate to MC algorithm. To achieve a better performance,
n
n
n need to be modified. In order to reduce the effect of step size on the results, we can multiply
1
−
γ
1-\gamma
1−γ to
V
(
s
)
V(s)
V(s), then the expected value should be in the same order of magnitude with different hyper-parameter
γ
\gamma
γ.
Pros and Cons
- pros
- more flexible than MC
- available for both online (SARSA) and offline (Q-Learning) situation
- TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods