强化学习RL- Lecture Note for CS188(暨CS181 ShanghaiTech)

说明:笔记旨在整理我校CS181课程的基本概念(PPT借用了Berkeley CS188)。由于授课及考试语言为英文,故英文出没可能

1 Reinforcement Learning

1.1 Online setting

Def Online MDP: partially observed markov decision process, with unknown transition and reward function.

Recall Markov decision process (MDP) elements:

Implementation: reinforcement Learning

 1.2 Model-based learning

Step1: Learn emprical MDP model

  • Count outcomes s' for each s, a
  • Normalize to give an estimate of \hat{T}(s, a, s')
  • Discover each \hat{R}(s, a, s') when we experience (s, a, s')

Step2: Solve the learned MDP

1.3 Model-free learning

1.Intuition: 不去还原MDP的transition model和reward方程,只是对用于做决策值进行估计。例如(Q值和V值)

2.Recall:S

1.4 Passive reinforcement learning

Simplified task: policy evaluation (w.r.t: a fixed policy) with unknown T and R.

1.5 Active reinforcement learning

Goal: Learn optimal policy/values with unknown T and R.

Trade-off: exploration vs. exploitation

Common terms in RL context:

  1. Exploration: you have to try unknown actions to get information
  2. Exploitation: eventually, you have to use what you know
  3. Regret: even if you learn intelligently, you make mistakes
  4. Sampling: because of chance, you have to try things repeatedly
  5. Difficulty: learning can be much harder than solving a known MDP

2 具体学习算法

2.1 Direct evaluation(Passive)

Goal: Compute values for each state under π

Idea: Average together observed sample values

  • Act according to π
  • Every time you visit a state, write down what the sum of discounted rewards turned to be(remark: may complete when reviewing the process, since knowing sum of discounted rewards means you need to know the future with respect to the current state)
  • Average those samples

Pros: It is easy to understand. It doesn't require any knowledge of T, R. It eventually computes the correct average values, using just sample transitions.

Cons: It wastes information about state connections. Each state must be learned separately. So, it takes a long time to learn.

2.2 Temporal Difference Learning(Passive)

Intuition: base on fixed policy, evaluate V(s) value each time we experience a transistion (s, a, s', r).

Formula:

  • \text{sample} = R(s, \pi(s), s') + \gamma V^{\pi}(s')
  • update: V^{\pi}(s)\gets (1 - \alpha) V^{\pi}(s) + (\alpha) \text{ sample}

Remark: to be temporally invariant, decrease alpha after trainning a long time.

Limitation: V value is hard to be updated a new policy. (TODO: learn Q-value instead of V, model-free!)

2.3 Q-Learning(Passive)

1. Q-Learning with known T, R:

2. Sample-based implementation (with unknown T,R):

  • Receive a sample (s, a, s', r)
    • At state s, take some action a w.r.t. some fixed policy. Environment returns the transition result, to the state s' and gives you the result r.
  • Old estimation: Q(s,a)
  • sample = R(s, a, s') + γ\max_{a'} Q(s', a')
  • Incorporate the new estimate into a running average:
    • Q(s,a)\gets (1-\alpha)Q(s,a)+a \cdot sample

2.4 Q-Learning(Active)

Learner makes choices with variant policy (according to current values/policy, and also explore).

Amazing result: Q-Learning converges to optimal policy -- even if you're acting suboptimally!

2.5 Summary

3 Learning policy

3.1 ε greedy

  • With (small) probability ε, act randomly

  • With(large) probaility ε, act on current policy 

3.2 Exploration functions

TBD: Next lecture on 2018/11/30

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值