强化学习RL- Lecture Note for CS188(暨CS181 ShanghaiTech)

最新推荐文章于 2022-05-13 22:41:18 发布

liubai01

最新推荐文章于 2022-05-13 22:41:18 发布

阅读量2.4k

点赞数 1

分类专栏：课程笔记文章标签：强化学习马尔科夫决策过程 CS188

本文链接：https://blog.csdn.net/liubai01/article/details/84570518

版权

课程笔记专栏收录该内容

11 篇文章 2 订阅

订阅专栏

说明：笔记旨在整理我校CS181课程的基本概念(PPT借用了Berkeley CS188)。由于授课及考试语言为英文，故英文出没可能。

1 Reinforcement Learning

1.1 Online setting

Def Online MDP: partially observed markov decision process, with unknown transition and reward function.

Recall Markov decision process (MDP) elements:

Implementation: reinforcement Learning

1.2 Model-based learning

Step1: Learn emprical MDP model

Count outcomes s' for each s, a
Normalize to give an estimate of $\hat{T}(s, a, s')$
Discover each $\hat{R}(s, a, s')$ when we experience (s, a, s')

Step2: Solve the learned MDP

Use methods mentioned before(value iteration)

1.3 Model-free learning

1.Intuition: 不去还原MDP的transition model和reward方程，只是对用于做决策值进行估计。例如（Q值和V值）

2.Recall:S

1.4 Passive reinforcement learning

Simplified task: policy evaluation (w.r.t: a fixed policy) with unknown T and R.

1.5 Active reinforcement learning

Goal: Learn optimal policy/values with unknown T and R.

Trade-off: exploration vs. exploitation

Common terms in RL context:

Exploration: you have to try unknown actions to get information
Exploitation: eventually, you have to use what you know
Regret: even if you learn intelligently, you make mistakes
Sampling: because of chance, you have to try things repeatedly
Difficulty: learning can be much harder than solving a known MDP

2 具体学习算法

2.1 Direct evaluation(Passive)

Goal: Compute values for each state under π

Idea: Average together observed sample values

Act according to π
Every time you visit a state, write down what the sum of discounted rewards turned to be(remark: may complete when reviewing the process, since knowing sum of discounted rewards means you need to know the future with respect to the current state)
Average those samples

Pros: It is easy to understand. It doesn't require any knowledge of T, R. It eventually computes the correct average values, using just sample transitions.

Cons: It wastes information about state connections. Each state must be learned separately. So, it takes a long time to learn.

2.2 Temporal Difference Learning(Passive)

Intuition: base on fixed policy, evaluate V(s) value each time we experience a transistion (s, a, s', r).

Formula:

$\text{sample} = R(s, \pi(s), s') + \gamma V^{\pi}(s')$
update: $V^{\pi}(s)\gets (1 - \alpha) V^{\pi}(s) + (\alpha) \text{ sample}$

Remark: to be temporally invariant, decrease alpha after trainning a long time.

Limitation: V value is hard to be updated a new policy. (TODO: learn Q-value instead of V, model-free!)

2.3 Q-Learning(Passive)

1. Q-Learning with known T, R:

2. Sample-based implementation (with unknown T,R):

Receive a sample (s, a, s', r)
- At state s, take some action a w.r.t. some fixed policy. Environment returns the transition result, to the state s' and gives you the result r.
Old estimation: Q(s,a)
sample = R(s, a, s') + γ $\max_{a'} Q(s', a')$
Incorporate the new estimate into a running average:
- $Q(s,a)\gets (1-\alpha)Q(s,a)+a \cdot sample$

2.4 Q-Learning(Active)

Learner makes choices with variant policy (according to current values/policy, and also explore).

Amazing result: Q-Learning converges to optimal policy -- even if you're acting suboptimally!

2.5 Summary

3 Learning policy

3.1 ε greedy

With (small) probability ε, act randomly
With(large) probaility ε, act on current policy

3.2 Exploration functions

TBD: Next lecture on 2018/11/30

liubai01

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
强化学习RL- Lecture Note for CS188(暨CS181 ShanghaiTech)

说明：笔记旨在整理我校CS181课程的基本概念(PPT借用了Berkeley CS188)。由于授课及考试语言为英文，故英文出没可能。1 Reinforcement Learning1.1 Online settingDef Online MDP: partially observed markov decision process, with unknown transition a...
复制链接

扫一扫