Lect1_Intro_RL

Introduction to Reinforcement Learning

The RL Problem

state

  1. Environment state S t e S_t^e Ste

  2. Agent state S t a S_t^a Sta

  3. Information state (a.k.a. Markov state)
    Definition: a state St is Markov if and only if
    P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] \mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right] P[St+1St]=P[St+1S1,,St]

Fully Observable Environments: O t = S t a = S t e O_t = S_t^a = S_t^e Ot=Sta=Ste

Partially Observable Environments: S t a ≠ S t e S_t^a \neq S_t^e Sta=Ste​​

Inside An RL Agent

  1. Policy: 行为函数,一般用 π \pi π 表示
  2. Value Function: 评价状态或动作的好坏
  3. Model: 智能体对环境的理解

Policy

a map from state to action

  • Deterministic policy: a = π ( s ) a = \pi(s) a=π(s)
  • Stochastic policy: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a \mid s) = \mathbb{P}[A_t = a \mid S_t = s] π(as)=P[At=aSt=s]

Value Function

a prediction of future reward, used to evalute the goodness/badness of states
v π ( s ) = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ S t = s ] v_\pi(s) = \mathbb{E}\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t =s \right] vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+St=s]
Rt+1表示的是在状态St下采取动作后得到的奖励,和一般记作Rt不同

Model

  • A model predicts what the environment will do next
  • P \mathcal{P} P predicts the next state
  • R \mathcal{R} R predicts the next (immediate) reward, e.g.

P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] R s a = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{P}_{ss'}^a = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s, A_t = a \right] \\ \mathcal{R}_s^a = \mathbb{E}\left[R_{t+1} \mid S_t = s, A_t = a \right] Pssa=P[St+1=sSt=s,At=a]Rsa=E[Rt+1St=s,At=a]

Problems within RL

Learning and Planning

Exploration and Exploitation

当存在一个较好的解决方案时,到底是选择 Exploration 更多关于环境的信息,还是 Exploitation 已知的信息去最大化奖励?二者之间的权衡问题

Prediction and Control

  • Prediction: 给定一个policy,计算agent能够得到多少reward,预估未来
  • Control: 确定众多的policy之中,哪一个决策能够得到最多的奖励,即最优策略

实际上二者是一个递进的关系,在RL中,通过解决预测问题来解决控制问题

用如下例子说明:

Prediction:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xr1nVjfP-在这里插入图片描述

  • 除了从A~A’和B~B’,其他步骤的反馈都是-1
  • Policy 是上下左右采取的概率都是25%

基于上述规则计算出Value Function 如上图 (b) 所示

Control:

在这里插入图片描述

背景和上面一样,这时行为模式policy是未知的,我们需要解决控制问题,求出最优的Value Function,这样最优的policy顺其自然就出来了

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值