Reinforcement Learning: An Introduction 读书笔记——第三章

Chaper3 The Reinforcement Learning Problem

3.2 Goals and Rewards

RL中agent的目标:

To maximize not immediate reward, but cumulative reward in the long run.

Reward:

The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

3.3 Returns

If the sequence of rewards received after time step t is denoted Rt+1, Rt+2, Rt+3,we seek to maximize the expected return, where the return, Gt, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:
T is a final time step.

  • Episodes: When the agent–environment interaction breaks naturally into subsequences. (such as plays of a game, trips through a maze, or any sort of repeated interactions.)
  • Each episode ends in a special state called the terminal state.
  • Tasks with episodes are called episodic tasks.
  • In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted S, from the set of all states plus the terminal state, denoted S+.

discounted return:

continuing tasks: agent-environment interaction does not break naturally into identiable episodes, but goes on continually without limit

3.4 Unified Notation for Episodic and Continuing Tasks

St,i is the state representation at time t of episode i (and similarly for At,i, Rt,i, πt,i, Ti, etc.) However, it turns out that, when we discuss episodic tasks we will almost never have to distinguish between different episodes. So we will write St to refer to St,i, and so on.

Absorbing state that transitions only to itself and that generates only rewards of zero:
在这里插入图片描述
The unified return can be written as:
在这里插入图片描述
including the possibility that T = ∞ or γ = 1 (but not both)

3.6 Markov Decision Processes

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).

transition probabilities: Given any state and action, s and a, the probability of each possible next state, s, is:

the expected value of the next reward is
在这里插入图片描述

3.7 Value Functions

A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s

  • The value of a state s under a policy π, denoted vπ(s), is the expected return when starting in s and following π thereafter——the state-value function for policy π:在这里插入图片描述
    where Eπ[·] denotes the expected value given that the agent follows policy π, and t is any time step.

  • The value of taking action a in state s under a policy π, denoted qπ(s, a), is the expected return starting from s, taking the action a, and thereafter following policy π——the action-value function for policy π:
    在这里插入图片描述

  • Bellman equation for vπ
    在这里插入图片描述
    It expresses a relationship between the value of a state and the values of its successor states.

Summary of Notation

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值