Reinforcement learning
offline planning:
agent have full knowledge of both transition function and rewards function
online planning:
agent have no prior knowledge transition function and rewards function
must do exploritation and receives feedback(successor states and rewards)
sample:
(s,a,s’,r)
episode:
a collection of samples which reach the terminal states
强化学习分类
model based learning
model based learning attempt to estimate transition function and rewards function and use them to solve mdp
model free learning
attempt to solve q-values or values directly,不会cosntruct reward function和transition function
Model-based learning
T
^
(
s
,
a
,
s
′
)
\hat T(s,a,s')
T^(s,a,s′): count
(
s
,
a
,
s
′
)
(s,a,s')
(s,a,s′) and normalize by q(s,a)
根据大数定律,
T
^
\hat T
T^会收敛,
R
^
\hat R
R^会被发现
在足够的exploration后。即可用MDP来求解
Model free learning
passive reinforcement learning: policy evaluation
given policy and follow it, learns a lot of value under it
active reinforcement learning: policy control
用 feedback iteratively update its policy until determining the optimal policy after sufficient exploration
direct evaluation(passive RL)
given a policy and follow it, utility/time
优点:
易于理解
足够多sample后converge
缺点:
slow, waste transition between states
state learned seperately
goal: compute values for each state under
π
\pi
π
idea: value=mean return
区分:
- first time MC:每个 episode 只更新一次,第一visit
- every time MC 每次遇到更新(we can use running mean)
transition based policy evaluation
temporal difference learning(passive RL)
learning from every experience
bellman equation:
V
π
(
s
)
=
∑
s
′
T
(
s
,
π
(
s
)
,
s
′
)
[
R
(
s
,
π
(
s
)
,
s
′
)
+
γ
V
π
(
s
′
)
]
V^{\pi}(s)=\sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V^{\pi}\left(s^{\prime}\right)\right]
Vπ(s)=∑s′T(s,π(s),s′)[R(s,π(s),s′)+γVπ(s′)]
how to compute the bellman equation without the weight:
T
(
s
,
π
(
s
)
,
s
′
)
T\left(s, \pi(s), s^{\prime}\right)
T(s,π(s),s′)
TD solve the problem using expotiential moving average
s
a
m
p
l
e
=
r
1
(
s
)
+
γ
V
π
(
s
′
)
sample=r_1(s)+\gamma V^{\pi}(s')
sample=r1(s)+γVπ(s′)
update:
V
π
(
s
)
=
(
1
−
α
)
V
π
(
s
)
+
α
s
a
m
p
l
e
V^{\pi}(s)=(1-\alpha)V^{\pi}(s)+\alpha sample
Vπ(s)=(1−α)Vπ(s)+αsample
learning rate:
α
\alpha
α
一般开始,
α
=
1
\alpha=1
α=1然后逐渐下降到
α
=
0
\alpha=0
α=0
the older samples are given expotienly less weight
优点:
- learning at every timestep
- give old samples less weight
- converge much quicker
TD error:
δ
t
=
r
t
+
γ
V
π
(
s
t
+
1
)
−
V
π
(
s
t
)
\delta_t=r_t+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t)
δt=rt+γVπ(st+1)−Vπ(st)
TD target:
r
t
+
γ
V
π
(
s
t
+
1
)
r_t+\gamma V^\pi(s_{t+1})
rt+γVπ(st+1)
Q-learning(off-policy learning)
direct and TD need some knowledge about transition function and reward function to compute the q-value to solve the problem
Q-value iteration:
Q
k
+
1
(
s
,
a
)
←
∑
s
′
T
(
s
,
a
,
s
′
)
[
R
(
s
,
a
,
s
′
)
+
γ
max
a
′
Q
k
(
s
′
,
a
′
)
]
Q_{k+1}(s, a) \leftarrow \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right]
Qk+1(s,a)←∑s′T(s,a,s′)[R(s,a,s′)+γmaxa′Qk(s′,a′)]
s
a
m
p
l
e
=
R
(
s
,
a
,
s
′
)
+
γ
m
a
x
a
′
Q
(
s
′
,
a
′
)
sample=R(s,a,s')+\gamma max_{a'}Q(s',a')
sample=R(s,a,s′)+γmaxa′Q(s′,a′)
Q
(
s
,
a
)
=
(
1
−
α
)
Q
(
s
,
a
)
+
α
s
a
m
p
l
e
=
Q
(
s
,
a
)
+
α
∗
d
i
f
f
e
r
e
n
c
e
Q(s,a)=(1-\alpha)Q(s,a)+\alpha sample=Q(s,a)+\alpha *difference
Q(s,a)=(1−α)Q(s,a)+αsample=Q(s,a)+α∗difference
policy control
用其他policy gather 的信息估计新的policy
Q-learning
exploration and exploitation
distributing time between exploration and exploitation
ϵ − g r e e d y \epsilon-greedy ϵ−greedy policies
ϵ
\epsilon
ϵ: act randomly and explore
1
−
ϵ
1-\epsilon
1−ϵ:follow the current policy and exploit
exploration function
可以避免人工调节 ϵ \epsilon ϵ的大小
Q ( s , a ) ← ( 1 − α ) Q ( s , a ) + α [ R ( s , a , s ′ ) + γ m a x α ′ f ( s ′ , α ′ ) ] Q(s,a)\leftarrow (1-\alpha)Q(s,a)+\alpha[R(s,a,s')+\gamma max_{\alpha'}f(s',\alpha')] Q(s,a)←(1−α)Q(s,a)+α[R(s,a,s′)+γmaxα′f(s′,α′)]
f ( s , a ) = Q ( s , a ) + k N ( s , a ) f(s,a)=Q(s,a)+\frac{k}{N(s,a)} f(s,a)=Q(s,a)+N(s,a)k
N
(
s
,
a
)
N(s,a)
N(s,a) the number of times
Q
(
s
,
a
)
Q(s,a)
Q(s,a) has been visited
k:predetermined value
approximate Q-learning
不能全部存储q-value的情况
keep a table of all the v and q
too many storage and experience
learn about a few general situations and extrapolate to many similar situations
p/r/v/pi/q
均方误差update
feature-based representation of states: feature vector
linear-value functions
MC:
d
i
f
f
e
r
e
n
c
e
=
G
t
−
x
t
T
w
difference=G_t-x_t^Tw
difference=Gt−xtTw
TS:
d
i
f
f
e
r
e
n
c
e
=
r
+
γ
x
t
(
s
′
)
T
w
−
x
t
(
s
)
T
w
difference=r+\gamma x_t(s')^Tw-x_t(s)^Tw
difference=r+γxt(s′)Tw−xt(s)Tw
d
i
f
f
e
r
e
n
c
e
=
[
R
(
s
,
a
,
s
′
)
+
γ
m
a
x
a
′
Q
(
s
′
,
a
′
)
]
−
Q
(
s
,
a
)
difference=[R(s,a,s')+\gamma max_{a'}Q(s',a')]-Q(s,a)
difference=[R(s,a,s′)+γmaxa′Q(s′,a′)]−Q(s,a)
update rule:
w
i
=
w
i
+
α
∗
d
i
f
f
e
r
e
n
c
e
∗
f
i
(
s
,
a
)
w_i=w_i+\alpha*difference*f_i(s,a)
wi=wi+α∗difference∗fi(s,a)
update equal to feature value* prediction errors*step size
问题
对于q-learning 收敛,每种动作应该explore足够多次,采用贪婪算法,则每次都采取最优的,不会探索非最优的动作,而对于fixed policy,空间探索不全
TD-LEARNING 所有的reward*正的cosntant 不会改变最优策略