Breif Introduction for Reinforcement Learning (Background Info)

Breif Introduction for Reinforcement Learning I (Background Info)

Markov Chain

Markov Decision Process

M = ( S , A , P , R ) M=(S,A,P,R) M=(S,A,P,R)

States : s i ∈ S s_i\in S siS
Actions : a i ∈ A a_i\in A aiA
Probability distribution of transitions : p ( s ′ ∣ s , a ) ∈ P s a p(s'|s,a)\in P_{sa} p(ss,a)Psa
Reward : r ( s ′ ∣ s , a ) r(s'|s,a) r(ss,a)

Value function: Bellman Equation

RL learns a policy π : S → A \pi:S\rightarrow A π:SA. Reward function R R R reflects only the real time REWARD. For a long term REWARD, we introduce Value Function V π ( s ) V^{\pi}(s) Vπ(s).

  • State Value Function
    V π ( s ) = ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ V π ( s ′ ) ] V^{\pi}(s)=\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^\pi(s')] Vπ(s)=sSp(ss,π(s))[r(ss,π(s))+γVπ(s)]
  • Action Value Function
    Q ( s , a ) = ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ V π ( s ′ ) ] Q(s,a)=\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma V^\pi(s')] Q(s,a)=sSp(ss,a)[r(ss,a)+γVπ(s)]
  • Connection
    In this section, we have 2 different value functions for states and actions. Consider V V V as a specialized version of Q Q Q with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with V V V.
    V π ( s ) = Q ( s , π ( s ) ) V^{\pi}(s)=Q(s,\pi(s)) Vπ(s)=Q(s,π(s))
  • Difference
    Q Q Q is defined on actions, but V V V is defined on states.
  • MDP Best Policy
    π ∗ = arg ⁡ max ⁡ π V π ( s ) , ∀ s ∈ S \pi^*=\arg\max_\pi V^\pi(s), \forall s\in S π=argπmaxVπ(s),sS

Basic Solutions

Dynamic Programming (?)
Policy Iteration
  • Policy Evaluation
    For a given policy π \pi π, Policy Evaluation algorithm calculates values of states v ( s ) v(s) v(s).

ALGORITHM: Policy_Evaluation
Input: π ( a ∣ s ) \pi(a|s) π(as), the mixed policy to be evaluted.
Initialize v ( s ) = 0 v(s)=0 v(s)=0 for all s ∈ S s\in S sS
Repeat

Δ ← 0 \Delta\leftarrow0 Δ0
For each s ∈ S s\in S sS

t m p ← ∑ a π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ v ( s ′ ) ] tmp\leftarrow\sum_{a}\pi(a|s)\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')] tmpaπ(as)sSp(ss,π(s))[r(ss,π(s))+γv(s)]
Δ ← max ⁡ ( Δ , ∣ t m p − v ( s ) ∣ ) \Delta\leftarrow\max(\Delta,|tmp-v(s)|) Δmax(Δ,tmpv(s))
v ( s ) ← t m p v(s)\leftarrow tmp v(s)tmp

Until Δ < θ \Delta<\theta Δ<θ (a small positive threshold)
Output: v ≈ v ∗ v\approx v^* vv, the approximate values of states.

  • Policy Improvement
    For a given policy π \pi π and values of states v ( s ) v(s) v(s), Policy Improvement algorithm can achieve a better policy with v ( s ) v(s) v(s) untouched.

ALGORITHM: Policy_Improvement
Input: π ( s ) \pi(s) π(s), v ( s ) v(s) v(s).
Repeat

policy_stable ← t r u e \leftarrow true true
For each s ∈ S s\in S sS

t m p ← arg ⁡ max ⁡ a ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) [ r ( s ′ ∣ s , π ( s ) ) + γ v ( s ′ ) ] tmp\leftarrow\arg\max_a\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma v(s')] tmpargmaxasSp(ss,π(s))[r(ss,π(s))+γv(s)]
If t m p ≠ π ( s ) tmp\not=\pi(s) tmp=π(s) Then policy_stable ← f a l s e \leftarrow false false
π ( s ) ← t m p \pi(s)\leftarrow tmp π(s)tmp

Until policy_stable = t r u e =true =true
Output: π = π ∗ \pi= \pi^* π=π, the Improved policy.

  • Policy Iteration
    Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows:
    π 0 → E v 0 → I π 1 → E v 1 → I π 2 → ⋯ → E v ∗ → I π ∗ \pi_0\rightarrow^{E} v_0\rightarrow^{I} \pi_1\rightarrow^{E} v_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} v^*\rightarrow^{I} \pi^* π0Ev0Iπ1Ev1Iπ2EvIπ

ALGORITHM: Policy_Iteration
Initializate v ( s ) ∈ R v(s)\in R v(s)R and π ( s ) ∈ A ( s ) \pi(s)\in A(s) π(s)A(s) randomly for all s ∈ S s\in S sS
Repeat

v ( s ) ← v(s)\leftarrow v(s) Policy_Evaluation ( π ) (\pi) (π)
π ′ ( s ) ← \pi'(s)\leftarrow π(s) Policy_Improvement ( π , v ) (\pi,v) (π,v)
policy_stable ← t r u e \leftarrow true true
If π ≠ π ′ \pi\not=\pi' π=π Then policy_stable ← f a l s e \leftarrow false false
π ← π ′ \pi\leftarrow\pi' ππ

Until policy_stable = t r u e =true =true
Output: π , v \pi, v π,v

Value Iteration

Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all s s s for one time.

ALGORITHM: Value_Iteration
Initializate v ( s ) ∈ R v(s)\in R v(s)R randomly for all s ∈ S s\in S sS
Repeat

Δ ← 0 \Delta\leftarrow0 Δ0
For each s ∈ S s\in S sS

t m p ← max ⁡ a ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ v ( s ′ ) ] tmp\leftarrow\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')] tmpmaxasSp(ss,a)[r(ss,a)+γv(s)]
Δ ← max ⁡ ( Δ , ∣ t m p − v ( s ) ∣ ) \Delta\leftarrow\max(\Delta,|tmp-v(s)|) Δmax(Δ,tmpv(s))
v ( s ) ← t m p v(s)\leftarrow tmp v(s)tmp

Until Δ < θ \Delta<\theta Δ<θ (a small positive threshold)
For each s ∈ S s\in S sS

π ( s ) ← arg ⁡ max ⁡ a ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ∣ s , a ) + γ v ( s ′ ) ] \pi(s)\leftarrow \arg\max_a\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma v(s')] π(s)argmaxasSp(ss,a)[r(ss,a)+γv(s)]

Output: π \pi π

Pros andc Cons
  • pros
    • interpretable
    • mathematical deduction based
  • cons
    • require complete environmental information
Monte Carlo

MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists s s s) and every-visit MC methods (number of s s s). In the section, we discuss first-visit MC methods only.

Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.

Monte Carlo Policy Evalution
  • Input: The policy to be evaluted
  • Step1: generate some state sequences (each sequence is a episode)
  • Step2: for each state, calculate the average reward among all episodes where exists s s s
  • Step3: set the average rewards as values of states
Mote Carlo Estimation of Action Values

To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows: π ′ ( s ) = arg ⁡ max ⁡ a Q π ( s , a ) \pi'(s)=\arg\max_a Q^\pi(s,a) π(s)=argamaxQπ(s,a)

Maintaining Exploration

There is a problem for MC method. If we already have predescribed Q-values: Q ( s , a 1 ) Q(s,a_1) Q(s,a1) Q ( s , a 1 ) Q(s,a_1) Q(s,a1) and Q ( s , a 1 ) > Q ( s , a 2 ) Q(s,a_1)>Q(s,a_2) Q(s,a1)>Q(s,a2), Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, ϵ − g r e e d y \epsilon-greedy ϵgreedy policy: execute the best action with a probability of 1 − ϵ 1-\epsilon 1ϵ, otherwise execute those worse actions. Decrease ϵ \epsilon ϵ by time and the algorithm will converge.

Mote Carlo Control

The process of Mote Carlo Control is as follows:
π 0 → E q 0 → I π 1 → E q 1 → I π 2 → ⋯ → E q ∗ → I π ∗ \pi_0\rightarrow^{E} q_0\rightarrow^{I} \pi_1\rightarrow^{E} q_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} q^*\rightarrow^{I} \pi^* π0Eq0Iπ1Eq1Iπ2EqIπ
We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.

Pros and Cons
  • pros
    • based on experience rather than the whole environment
  • cons
    • worked on episode tasks only
Temporal-Difference
TD Prediction

Consider the Bellman Equation for value
V π ( s t ) = E π [ R ( s t + 1 ) + γ V π ( s t + 1 ) ∣ s t + 1 = π ( s t ) ] V_\pi(s_t)=E_\pi[R(s_{t+1})+\gamma V_\pi(s_{t+1})|s_{t+1}=\pi(s_t)] Vπ(st)=Eπ[R(st+1)+γVπ(st+1)st+1=π(st)]
When policy π \pi π is fixed, we have
V ( s t ) = R ( s t + 1 ) + γ V ( s t + 1 ) V(s_t)=R(s_{t+1})+\gamma V(s_{t+1}) V(st)=R(st+1)+γV(st+1)
Then we have td_error
t d _ e r r o r = ∣ R ( s t + 1 ) + γ V π ( s t + 1 ) − V π ( s t ) ∣ td\_error=|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)| td_error=R(st+1)+γVπ(st+1)Vπ(st)
To optimize the model, all we need is to modify policy π \pi π to minimize td_error as follows
π ∗ = arg ⁡ min ⁡ π ∣ R ( s t + 1 ) + γ V π ( s t + 1 ) − V π ( s t ) ∣ \pi^*=\arg\min_\pi|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)| π=argπminR(st+1)+γVπ(st+1)Vπ(st)

N-step TD

Consider the definition of state value
V ( s t ) = R ( s t + 1 ) + γ R ( s t + 2 ) + ⋯ + γ n − 1 R ( s t + n ) + γ n V ( s t + n ) V(s_t)=R(s_{t+1})+\gamma R(s_{t+2})+\cdots+\gamma^{n-1} R(s_{t+n})+\gamma^{n} V(s_{t+n}) V(st)=R(st+1)+γR(st+2)++γn1R(st+n)+γnV(st+n)
Similarly in TD algorithm, we can reform state value and achieve new td_error with any step m m m. If we set n → inf ⁡ n\rightarrow\inf ninf, then it will degenerate to MC algorithm. To achieve a better performance, n n n need to be modified. In order to reduce the effect of step size on the results, we can multiply 1 − γ 1-\gamma 1γ to V ( s ) V(s) V(s), then the expected value should be in the same order of magnitude with different hyper-parameter γ \gamma γ.

Pros and Cons
  • pros
    • more flexible than MC
    • available for both online (SARSA) and offline (Q-Learning) situation
    • TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值