说明:笔记旨在整理我校CS181课程的基本概念(PPT借用了Berkeley CS188)。由于授课及考试语言为英文,故英文出没可能。
1 Reinforcement Learning
1.1 Online setting
Def Online MDP: partially observed markov decision process, with unknown transition and reward function.
Recall Markov decision process (MDP) elements:
Implementation: reinforcement Learning
1.2 Model-based learning
Step1: Learn emprical MDP model
- Count outcomes s' for each s, a
- Normalize to give an estimate of
- Discover each
when we experience (s, a, s')
Step2: Solve the learned MDP
- Use methods mentioned before(value iteration)
1.3 Model-free learning
1.Intuition: 不去还原MDP的transition model和reward方程,只是对用于做决策值进行估计。例如(Q值和V值)
2.Recall:S
1.4 Passive reinforcement learning
Simplified task: policy evaluation (w.r.t: a fixed policy) with unknown T and R.
1.5 Active reinforcement learning
Goal: Learn optimal policy/values with unknown T and R.
Trade-off: exploration vs. exploitation
Common terms in RL context:
- Exploration: you have to try unknown actions to get information
- Exploitation: eventually, you have to use what you know
- Regret: even if you learn intelligently, you make mistakes
- Sampling: because of chance, you have to try things repeatedly
- Difficulty: learning can be much harder than solving a known MDP
2 具体学习算法
2.1 Direct evaluation(Passive)
Goal: Compute values for each state under π
Idea: Average together observed sample values
- Act according to π
- Every time you visit a state, write down what the sum of discounted rewards turned to be(remark: may complete when reviewing the process, since knowing sum of discounted rewards means you need to know the future with respect to the current state)
- Average those samples
Pros: It is easy to understand. It doesn't require any knowledge of T, R. It eventually computes the correct average values, using just sample transitions.
Cons: It wastes information about state connections. Each state must be learned separately. So, it takes a long time to learn.
2.2 Temporal Difference Learning(Passive)
Intuition: base on fixed policy, evaluate V(s) value each time we experience a transistion (s, a, s', r).
Formula:
- update:
Remark: to be temporally invariant, decrease alpha after trainning a long time.
Limitation: V value is hard to be updated a new policy. (TODO: learn Q-value instead of V, model-free!)
2.3 Q-Learning(Passive)
1. Q-Learning with known T, R:
2. Sample-based implementation (with unknown T,R):
- Receive a sample (s, a, s', r)
- At state s, take some action a w.r.t. some fixed policy. Environment returns the transition result, to the state s' and gives you the result r.
- Old estimation: Q(s,a)
- sample = R(s, a, s') + γ
- Incorporate the new estimate into a running average:
2.4 Q-Learning(Active)
Learner makes choices with variant policy (according to current values/policy, and also explore).
Amazing result: Q-Learning converges to optimal policy -- even if you're acting suboptimally!
2.5 Summary
3 Learning policy
3.1 ε greedy
-
With (small) probability ε, act randomly
-
With(large) probaility ε, act on current policy