CS234 RL Winter

最新推荐文章于 2023-04-28 18:56:53 发布

chikeshi

最新推荐文章于 2023-04-28 18:56:53 发布

阅读量198

点赞数

文章标签：强化学习

本文链接：https://blog.csdn.net/chikeshi/article/details/118437572

版权

lecture 1

Intro

$r$ Reward function, can be a function of $r(s) $or $r (s, a)$
$s$ , state
$a$ , action
$h$ : history
$\gamma$ : discount factor
- mathematically convenient. $\gamma = 0$ : only care about immediate reward
$\pi^*$ : optimal policy
$V^{\pi}_k(s)$ : state value funciton at step $k$ with policy $\pi$ with state $s$
$Q^\pi (s,a)$ : State-Action Value Q: Take action a, then follow the policy $\pi$
model: how world change, given $s_t$ and $a_t$ , stocastic or deterministic
model-free: you don’t know how world change, e.x two player game
policy: how you react given a state $s$
value:
expoloration try new things that might be better in the future
Expliotation: choose actions that are expected to yield good rewared given past expereince
Horizon: the number of actions you can take to reach termination, could be finite or infinite
Episode: series of actions from start to end
$G_t$ discuonted sum of rewards $G_t = r_t + \gamma r+{t+1} + \gamma^2 r_{t+2} ...$
$V (s)$ State Value Function, expected return form starting in state s $V(s) = E[G_t | s_t = s]$
$O$ : contraction operator. $\leq |V - V'|$
lecture 2

Markov (Decision / reward) process

Bandit: single state MDP: only care about current state

State $s_t$ is Markov iff
$p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_t, a_t)$

Markov Chain: each state has a fix distribution of the next state
Markov Reward process

Markov chain + reward
In a n step episode, MRP value function satisifies
$\underbrace{R(S)}_{\text{Immediate reward}} + \underbrace{\gamma \sum _{s' \in S} p(s' | s)V(s')}_\text{Discounted sum of future rewards}$

Iterative algorithm for computing value of a MRP

在这里插入图片描述

Markov Decision Process

MRP + actions
P is transition model for each action, so taken a state and a action, $P(s_{t+1} s= s' | s_t = s, a_t = a)$
The next state is usually not deterministic given a state and a action

在这里插入图片描述

Quiz

Suppose you hvae 7 discrete states and 2 actions, how many deterministic policies are there?

2^7

Is the optimal policy for a MDP always unique?

no

Policy Search

A: action, S: state, try all $A^S$ possibilities

Policy Iteration

Set i = 0
Initialize π0(s) randomly for all states s
While i == 0 or kπi − πi−1k1 > 0 (L1-norm, measures if the policy
changed for any state):
$V^π_i$ ← MDP V function policy evaluation of πi
$π_{i+1}$ ← Policy improvement
i = i + 1

Q value

State-action value of a policy
$Q^\pi(s,a) = R(s,a) + \gamma \sum _{s' \in S} P(s'| s, a) v^\pi(s')$

Policy Improvement

compute new poliy $\pi_{i+1}$ , for all $\in S$
$\pi_{i+1}(s) = \underset{a}{\argmax}~ Q^{\pi_i}(s,a) \forall s \in S$
note: so this is update policy for all states, not just one state?

Policy Iteration quiz

在这里插入图片描述

Bellman backup operator

$\underset{a}{\max} R(s,a) + \gamma \sum _{s' \in S} p(s'|a)V(s')$

BV yields a value function overal all state s

Value Iteration

Bellman Backup is a contraction operator

对于 $V (s), V^{'} (s)$ , 如果两个目前都去取最优的action，a，那么自然而然两者的差距会缩小。

Policy Evaluation with dp

$V^\pi(s) \approx \mathbb{E}_\pi [r_t + \gamma V_{k-1} | s_t = s]$

lecture3

Monte carlo policy evaluatoin

No model

(s,a) is not deterministic
don’t assume Markov
Require episode to terminate
If trajectories are all finite, sample set of trajectories & average returns

First-visit Monte Carlo vs Every-visit Monte carlo policy evaluation

first time: only account for reward when you first reach the state in an episode
everytime: acount for all, biased estimator

在这里插入图片描述
This is to calculate the V of s

Temporal Difference Learning for Estimating V (TD learning)

Update immediately rather than wait until the end of episode
在这里插入图片描述

lecture4

Model-free control

Model is unknown but can be sampled
Model is known but computttionally infeasible to conpute

On/Off policy

On policy
- Learn from following that policy
Off policy
- learn form following different policy

MC for on policy Q

在这里插入图片描述

SARSA algorithm (TD)

Set initial $\epsilon$ -greedy policy $\pi$ randomly, t = 0, initial state $s_t = s_0$
Take $a_t$ ~ $\pi(s_t)$ // Sample from policy
Observe $r_t, s_{t+1})$
loop
1. Take action $a_{t+1} \sim \pi(S_{t+1})$
2. Observe $r_{t+1}, s_{t+2})$
3. Update Q given $s_t, a_t, r_t, s_{t+1}, a_{t+1})$
  1. $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)$
4. $\pi(s_t) = \argmax_a Q(s_t, a)$ w. prob 1 - $\epsilon$ , else random
5. t = t+1

Q learning

If you want to take bad action early stages, and gain more later.

GLIE

You can’t satisfy GLIE for example the case helicopter. If you break helicopter you can’t go make and make another decision.

Q-learning

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma \max_{a'}Q(s_{t+1}, a') - Q(s_t, a_t)$
在这里插入图片描述
注意这里的a是所有的action选最好的，NARSA是选目前policy的

chikeshi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS234 RL Winter

文章目录IntroMarkov (Decision / reward) processlecture 1Introrrr Reward function, can be a function of $r(s) $or r(s,a)r(s,a)r(s,a)sss, stateaaa, actionhhh: historyγ\gammaγ: discount factormathematically convenient. γ=0\gamma = 0γ=0: only care abou
复制链接

扫一扫