reinforcement learning

最新推荐文章于 2023-11-26 17:00:00 发布

@yuqing_wang

最新推荐文章于 2023-11-26 17:00:00 发布

阅读量269

点赞数 1

分类专栏： Introduction to AI

本文链接：https://blog.csdn.net/weixin_43199124/article/details/110875372

版权

Introduction to AI 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Reinforcement learning

offline planning:
agent have full knowledge of both transition function and rewards function

online planning:
agent have no prior knowledge transition function and rewards function
must do exploritation and receives feedback（successor states and rewards）

sample:
(s,a,s’,r)

episode:
a collection of samples which reach the terminal states

强化学习分类

model based learning
model based learning attempt to estimate transition function and rewards function and use them to solve mdp

model free learning
attempt to solve q-values or values directly，不会cosntruct reward function和transition function

Model-based learning

$\hat T(s,a,s')$ : count $(s, a, s^{'})$ and normalize by q(s,a)
根据大数定律， $\hat T$ 会收敛， $\hat R$ 会被发现
在足够的exploration后。即可用MDP来求解
在这里插入图片描述

Model free learning

passive reinforcement learning: policy evaluation
given policy and follow it, learns a lot of value under it

active reinforcement learning: policy control
用 feedback iteratively update its policy until determining the optimal policy after sufficient exploration

direct evaluation(passive RL)

given a policy and follow it, utility/time
优点：
易于理解
足够多sample后converge

缺点：
slow, waste transition between states
state learned seperately

goal: compute values for each state under $\pi$
idea: value=mean return
区分：

first time MC:每个 episode 只更新一次，第一visit
every time MC 每次遇到更新（we can use running mean）

transition based policy evaluation

在这里插入图片描述

temporal difference learning(passive RL)

learning from every experience
bellman equation:
$V^{\pi}(s)=\sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V^{\pi}\left(s^{\prime}\right)\right]$
how to compute the bellman equation without the weight: $T\left(s, \pi(s), s^{\prime}\right)$
TD solve the problem using expotiential moving average

$sample=r_1(s)+\gamma V^{\pi}(s')$
update: $V^{\pi}(s)=(1-\alpha)V^{\pi}(s)+\alpha sample$

learning rate: $\alpha$
一般开始， $\alpha=1$ 然后逐渐下降到 $\alpha=0$

the older samples are given expotienly less weight

优点：

learning at every timestep
give old samples less weight
converge much quicker

TD error: $\delta_t=r_t+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t)$
TD target: $r_t+\gamma V^\pi(s_{t+1})$
在这里插入图片描述

Q-learning(off-policy learning)

在这里插入图片描述

direct and TD need some knowledge about transition function and reward function to compute the q-value to solve the problem

Q-value iteration:
$Q_{k+1}(s, a) \leftarrow \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right]$

$sample=R(s,a,s')+\gamma max_{a'}Q(s',a')$
$Q(s,a)=(1-\alpha)Q(s,a)+\alpha sample=Q(s,a)+\alpha *difference$

policy control

用其他policy gather 的信息估计新的policy
Q-learning

exploration and exploitation

distributing time between exploration and exploitation

$\epsilon-greedy$ policies

$\epsilon$ : act randomly and explore
$1-\epsilon$ :follow the current policy and exploit

exploration function

可以避免人工调节 $\epsilon$ 的大小

$Q(s,a)\leftarrow (1-\alpha)Q(s,a)+\alpha[R(s,a,s')+\gamma max_{\alpha'}f(s',\alpha')]$

$f(s,a)=Q(s,a)+\frac{k}{N(s,a)}$

$N (s, a)$ the number of times $Q (s, a)$ has been visited
k：predetermined value

approximate Q-learning

不能全部存储q-value的情况
keep a table of all the v and q
too many storage and experience
learn about a few general situations and extrapolate to many similar situations

p/r/v/pi/q
均方误差update
feature-based representation of states: feature vector
linear-value functions
MC: $difference=G_t-x_t^Tw$
TS: $difference=r+\gamma x_t(s')^Tw-x_t(s)^Tw$
$difference=[R(s,a,s')+\gamma max_{a'}Q(s',a')]-Q(s,a)$
update rule:
$w_i=w_i+\alpha*difference*f_i(s,a)$
update equal to feature value* prediction errors*step size

问题

对于q-learning 收敛，每种动作应该explore足够多次，采用贪婪算法，则每次都采取最优的，不会探索非最优的动作，而对于fixed policy，空间探索不全
TD-LEARNING 所有的reward*正的cosntant 不会改变最优策略

@yuqing_wang

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
reinforcement learning

Reinforcement learningoffline planning: agent have full knowledge of both transition function and rewards functiononline planning: agent have no prior knowledge transition function and rewards functionmust do exploritation and receives feedbacksample
复制链接

扫一扫