Reinforcement Learning: An Introduction 读书笔记——第三章

最新推荐文章于 2024-11-18 16:46:39 发布

TZX世界第一可爱

最新推荐文章于 2024-11-18 16:46:39 发布

阅读量212

点赞数

分类专栏： Reinforcement Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/joycethoreau/article/details/107189394

版权

Reinforcement Learning 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Reinforcement Learning: An Introduction读书笔记——第三章

Chaper3 The Reinforcement Learning Problem

Chaper3 The Reinforcement Learning Problem

3.2 Goals and Rewards

RL中agent的目标：

To maximize not immediate reward, but cumulative reward in the long run.

Reward:

The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

3.3 Returns

If the sequence of rewards received after time step t is denoted R_t+1, R_t+2, R_t+3,we seek to maximize the expected return, where the return, G_t, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:
T is a final time step.

Episodes: When the agent–environment interaction breaks naturally into subsequences. (such as plays of a game, trips through a maze, or any sort of repeated interactions.)
Each episode ends in a special state called the terminal state.
Tasks with episodes are called episodic tasks.
In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted S, from the set of all states plus the terminal state, denoted S⁺.

discounted return:

continuing tasks: agent-environment interaction does not break naturally into identiable episodes, but goes on continually without limit

3.4 Uniﬁed Notation for Episodic and Continuing Tasks

S_t,i is the state representation at time t of episode i (and similarly for A_t,i, R_t,i, π_t,i, T_i, etc.) However, it turns out that, when we discuss episodic tasks we will almost never have to distinguish between diﬀerent episodes. So we will write S_t to refer to S_t,i, and so on.

Absorbing state that transitions only to itself and that generates only rewards of zero：
在这里插入图片描述
The unified return can be written as:

including the possibility that T = ∞ or γ = 1 (but not both)

3.6 Markov Decision Processes

A reinforcement learning task that satisﬁes the Markov property is called a Markov decision process, or MDP. If the state and action spaces are ﬁnite, then it is called a ﬁnite Markov decision process (ﬁnite MDP).

transition probabilities: Given any state and action, s and a, the probability of each possible next state, s^’, is:

the expected value of the next reward is
在这里插入图片描述

3.7 Value Functions

A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s

The value of a state s under a policy π, denoted v_π(s), is the expected return when starting in s and following π thereafter——the state-value function for policy π:
where E_π[·] denotes the expected value given that the agent follows policy π, and t is any time step.
The value of taking action a in state s under a policy π, denoted q_π(s, a), is the expected return starting from s, taking the action a, and thereafter following policy π——the action-value function for policy π:
Bellman equation for v_π

It expresses a relationship between the value of a state and the values of its successor states.