Reinforcement Learning: An Introduction读书笔记——第三章
Chaper3 The Reinforcement Learning Problem
3.2 Goals and Rewards
RL中agent的目标:
To maximize not immediate reward, but cumulative reward in the long run.
Reward:
The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.
3.3 Returns
If the sequence of rewards received after time step t is denoted Rt+1, Rt+2, Rt+3,we seek to maximize the expected return, where the return, Gt, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:
- Episodes: When the agent–environment interaction breaks naturally into subsequences. (such as plays of a game, trips through a maze, or any sort of repeated interactions.)
- Each episode ends in a special state called the terminal state.
- Tasks with episodes are called episodic tasks.
- In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted S, from the set of all states plus the terminal state, denoted S+.
discounted return:
continuing tasks: agent-environment interaction does not break naturally into identiable episodes, but goes on continually without limit
3.4 Unified Notation for Episodic and Continuing Tasks
St,i is the state representation at time t of episode i (and similarly for At,i, Rt,i, πt,i, Ti, etc.) However, it turns out that, when we discuss episodic tasks we will almost never have to distinguish between different episodes. So we will write St to refer to St,i, and so on.
Absorbing state that transitions only to itself and that generates only rewards of zero:
The unified return can be written as:
including the possibility that T = ∞ or γ = 1 (but not both)
3.6 Markov Decision Processes
A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).
transition probabilities: Given any state and action, s and a, the probability of each possible next state, s’, is:
the expected value of the next reward is
3.7 Value Functions
A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s
-
The value of a state s under a policy π, denoted vπ(s), is the expected return when starting in s and following π thereafter——the state-value function for policy π:
where Eπ[·] denotes the expected value given that the agent follows policy π, and t is any time step. -
The value of taking action a in state s under a policy π, denoted qπ(s, a), is the expected return starting from s, taking the action a, and thereafter following policy π——the action-value function for policy π:
-
Bellman equation for vπ
It expresses a relationship between the value of a state and the values of its successor states.