Lect1_Intro_RL

Ricky050

于 2021-09-17 17:45:46 发布

阅读量94

点赞数

分类专栏： RL_by_DavidSilver_notes 文章标签：强化学习

本文链接：https://blog.csdn.net/zzping01/article/details/120353770

版权

RL_by_DavidSilver_notes 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

The RL Problem

state

Environment state $S_t^e$
Agent state $S_t^a$
Information state (a.k.a. Markov state)
Definition: a state S_t is Markov if and only if
$\mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right]$

Fully Observable Environments: $O_t = S_t^a = S_t^e$

Partially Observable Environments: $S_t^a \neq S_t^e$

Inside An RL Agent

Policy: 行为函数，一般用 $\pi$ 表示
Value Function: 评价状态或动作的好坏
Model: 智能体对环境的理解

Policy

a map from state to action

Deterministic policy: $\pi(s)$
Stochastic policy: $\pi(a \mid s) = \mathbb{P}[A_t = a \mid S_t = s]$

Value Function

a prediction of future reward, used to evalute the goodness/badness of states
$v_\pi(s) = \mathbb{E}\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t =s \right]$
R_t+1表示的是在状态S_t下采取动作后得到的奖励，和一般记作R_t不同

Model

A model predicts what the environment will do next
$\mathcal{P}$ predicts the next state
$\mathcal{R}$ predicts the next (immediate) reward, e.g.

$\mathcal{P}_{ss'}^a = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s, A_t = a \right] \\ \mathcal{R}_s^a = \mathbb{E}\left[R_{t+1} \mid S_t = s, A_t = a \right]$