reinforcement learning

Reinforcement learning

offline planning:
agent have full knowledge of both transition function and rewards function

online planning:
agent have no prior knowledge transition function and rewards function
must do exploritation and receives feedback(successor states and rewards)

sample:
(s,a,s’,r)

episode:
a collection of samples which reach the terminal states

强化学习分类

model based learning
model based learning attempt to estimate transition function and rewards function and use them to solve mdp

model free learning
attempt to solve q-values or values directly,不会cosntruct reward function和transition function

Model-based learning

T ^ ( s , a , s ′ ) \hat T(s,a,s') T^(s,a,s): count ( s , a , s ′ ) (s,a,s') (s,a,s) and normalize by q(s,a)
根据大数定律, T ^ \hat T T^会收敛, R ^ \hat R R^会被发现
在足够的exploration后。即可用MDP来求解
在这里插入图片描述

Model free learning

passive reinforcement learning: policy evaluation
given policy and follow it, learns a lot of value under it

active reinforcement learning: policy control
用 feedback iteratively update its policy until determining the optimal policy after sufficient exploration

direct evaluation(passive RL)

given a policy and follow it, utility/time
优点:
易于理解
足够多sample后converge

缺点:
slow, waste transition between states
state learned seperately

goal: compute values for each state under π \pi π
idea: value=mean return
区分:

  1. first time MC:每个 episode 只更新一次,第一visit
  2. every time MC 每次遇到更新(we can use running mean)
    在这里插入图片描述

transition based policy evaluation

在这里插入图片描述

temporal difference learning(passive RL)

learning from every experience
bellman equation:
V π ( s ) = ∑ s ′ T ( s , π ( s ) , s ′ ) [ R ( s , π ( s ) , s ′ ) + γ V π ( s ′ ) ] V^{\pi}(s)=\sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V^{\pi}\left(s^{\prime}\right)\right] Vπ(s)=sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]
how to compute the bellman equation without the weight: T ( s , π ( s ) , s ′ ) T\left(s, \pi(s), s^{\prime}\right) T(s,π(s),s)
TD solve the problem using expotiential moving average

s a m p l e = r 1 ( s ) + γ V π ( s ′ ) sample=r_1(s)+\gamma V^{\pi}(s') sample=r1(s)+γVπ(s)
update: V π ( s ) = ( 1 − α ) V π ( s ) + α s a m p l e V^{\pi}(s)=(1-\alpha)V^{\pi}(s)+\alpha sample Vπ(s)=(1α)Vπ(s)+αsample

learning rate: α \alpha α
一般开始, α = 1 \alpha=1 α=1然后逐渐下降到 α = 0 \alpha=0 α=0

the older samples are given expotienly less weight

优点:

  1. learning at every timestep
  2. give old samples less weight
  3. converge much quicker

TD error: δ t = r t + γ V π ( s t + 1 ) − V π ( s t ) \delta_t=r_t+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t) δt=rt+γVπ(st+1)Vπ(st)
TD target: r t + γ V π ( s t + 1 ) r_t+\gamma V^\pi(s_{t+1}) rt+γVπ(st+1)
在这里插入图片描述
在这里插入图片描述

Q-learning(off-policy learning)

在这里插入图片描述

direct and TD need some knowledge about transition function and reward function to compute the q-value to solve the problem

Q-value iteration:
Q k + 1 ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ max ⁡ a ′ Q k ( s ′ , a ′ ) ] Q_{k+1}(s, a) \leftarrow \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right] Qk+1(s,a)sT(s,a,s)[R(s,a,s)+γmaxaQk(s,a)]

s a m p l e = R ( s , a , s ′ ) + γ m a x a ′ Q ( s ′ , a ′ ) sample=R(s,a,s')+\gamma max_{a'}Q(s',a') sample=R(s,a,s)+γmaxaQ(s,a)
Q ( s , a ) = ( 1 − α ) Q ( s , a ) + α s a m p l e = Q ( s , a ) + α ∗ d i f f e r e n c e Q(s,a)=(1-\alpha)Q(s,a)+\alpha sample=Q(s,a)+\alpha *difference Q(s,a)=(1α)Q(s,a)+αsample=Q(s,a)+αdifference

policy control

用其他policy gather 的信息估计新的policy
Q-learning

exploration and exploitation

distributing time between exploration and exploitation

ϵ − g r e e d y \epsilon-greedy ϵgreedy policies

ϵ \epsilon ϵ: act randomly and explore
1 − ϵ 1-\epsilon 1ϵ:follow the current policy and exploit

exploration function

可以避免人工调节 ϵ \epsilon ϵ的大小

Q ( s , a ) ← ( 1 − α ) Q ( s , a ) + α [ R ( s , a , s ′ ) + γ m a x α ′ f ( s ′ , α ′ ) ] Q(s,a)\leftarrow (1-\alpha)Q(s,a)+\alpha[R(s,a,s')+\gamma max_{\alpha'}f(s',\alpha')] Q(s,a)(1α)Q(s,a)+α[R(s,a,s)+γmaxαf(s,α)]

f ( s , a ) = Q ( s , a ) + k N ( s , a ) f(s,a)=Q(s,a)+\frac{k}{N(s,a)} f(s,a)=Q(s,a)+N(s,a)k

N ( s , a ) N(s,a) N(s,a) the number of times Q ( s , a ) Q(s,a) Q(s,a) has been visited
k:predetermined value

approximate Q-learning

不能全部存储q-value的情况
keep a table of all the v and q
too many storage and experience
learn about a few general situations and extrapolate to many similar situations

p/r/v/pi/q
均方误差update
feature-based representation of states: feature vector
linear-value functions
MC: d i f f e r e n c e = G t − x t T w difference=G_t-x_t^Tw difference=GtxtTw
TS: d i f f e r e n c e = r + γ x t ( s ′ ) T w − x t ( s ) T w difference=r+\gamma x_t(s')^Tw-x_t(s)^Tw difference=r+γxt(s)Twxt(s)Tw
d i f f e r e n c e = [ R ( s , a , s ′ ) + γ m a x a ′ Q ( s ′ , a ′ ) ] − Q ( s , a ) difference=[R(s,a,s')+\gamma max_{a'}Q(s',a')]-Q(s,a) difference=[R(s,a,s)+γmaxaQ(s,a)]Q(s,a)
update rule:
w i = w i + α ∗ d i f f e r e n c e ∗ f i ( s , a ) w_i=w_i+\alpha*difference*f_i(s,a) wi=wi+αdifferencefi(s,a)
update equal to feature value* prediction errors*step size

问题

对于q-learning 收敛,每种动作应该explore足够多次,采用贪婪算法,则每次都采取最优的,不会探索非最优的动作,而对于fixed policy,空间探索不全
TD-LEARNING 所有的reward*正的cosntant 不会改变最优策略

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值