强化学习中的有限马尔可夫决策过程 Finite Markov Decision Processes in RL

Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction - 2nd Edition.

Here we summarize some basic notions and formulations in most reinforcement learning problems. This note DO NOT include detailed explanantion of each notion. Refer to the references above if you want a deeper insight.

Markov decision processes are a classcial formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequentsituations, or states, and through those future rewards. MDPs are a mathematically idealized form of the reinforcement learning problem.

Agent-Environment Interface

Interface
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations (state) to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

More speci cally, the agent and environment interact at each of a sequence of discrete time steps, t=0,1,2,3, t = 0 , 1 , 2 , 3 , … . At each time step t t , the agent receives some representation of environment’s state,StS, and on that basis selects an action, AtA(s) A t ∈ A ( s ) . One time step later, in part as a consequence of its action, the agent receives a numerical reward, Rt+1RR R t + 1 ∈ R ⊂ R , and finds itself in a new state, St+1 S t + 1 . The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

S0,A0,R1,S1,A1,R2,S2,A2,R3, S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , A 2 , R 3 , …

In a finite MDP, the sets of states, actions and rewards all have a finite number of elements. In this case, the random variables Rt R t and St S t have well de ned discrete probability distributions dependent only on the preceding state and action.

p(s,r|s,a)=Pr{St=s,Rt=r|St1=s,At1=a},s,sS,rR,aA(s) p ( s ′ , r | s , a ) = P r { S t = s ′ , R t = r | S t − 1 = s , A t − 1 = a } , ∀ s ′ , s ∈ S , r ∈ R , a ∈ A ( s )

sSrRp(s,r|s,a)=1,sS,aA(s) ∑ s ′ ∈ S ∑ r ∈ R p ( s ′ , r | s , a ) = 1 , ∀ s ∈ S , a ∈ A ( s )

One can compute anything else one might want to know about the environment, such as the state-transition probabilities:
p(s|s,a)=Pr{St=s|St1=s,At1=a}=rRp(s,r|s,a) p ( s ′ | s , a ) = P r { S t = s ′ | S t − 1 = s , A t − 1 = a } = ∑ r ∈ R p ( s ′ , r | s , a )

We can also compute the expected rewards for state{action pairs
r(s,a)=E[Rt|St1=s,At1=a]=rRrsSp(s,r|s,a) r ( s , a ) = E [ R t | S t − 1 = s , A t − 1 = a ] = ∑ r ∈ R r ∑ s ′ ∈ S p ( s ′ , r | s , a )
Or
r(s,a,s)=E[Rt|St1=s,At1=a,St=s]=rRrp(s,r|s,a)p(s|s,a) r ( s , a , s ′ ) = E [ R t | S t − 1 = s , A t − 1 = a , S t = s ′ ] = ∑ r ∈ R r p ( s ′ , r | s , a ) p ( s ′ | s , a )

Goals and Rewards

That all of what we mean by goals and purpose can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal, called reward.

The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished.

Returns and Episodes

We seek to maximize the expected return, where the return, denoted Gt G t , is defined as some speci c function of the reward sequence. In the simplest case the return is the sum of the rewards:

Gt=Rt+1+Rt+2++RT G t = R t + 1 + R t + 2 + ⋯ + R T
where T T is a final time step. This approach makes sense in applications in which there is a natural notion of time step, that is, when the agent-environment interaction breaks naturally into subsequences, which we call episodes. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks.

On the other hand, in many cases the agent-environment interaction does not break naturally into identi able episodes, but goes on continually without limit. We call these continuing tasks. Then we introduce an additional concept, discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses At to maximize the expected discounted return:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1=Rt+1+γGt+1 G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 = R t + 1 + γ G t + 1
where γ γ is a parameter, 0γ1 0 ≤ γ ≤ 1 , called the discounted rate.

Policies and Value Functions

Almost all reinforcement learning algorithms involve estimating value functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defi ned in terms of future rewards that can be expected, or, to be precise, in terms of expected return.

A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy π π at time t t , then π(a|s) is the probability that At=a A t = a if St=s S t = s .

The value of a state s s under a policy π, denoted vπ(s) v π ( s ) , is the expected return when starting in s s and following π thereafter.

vπ(s)=Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1St=s],sS v π ( s ) = E π [ G t | S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 | S t = s ] , ∀ s ∈ S
where Eπ[] E π [ ⋅ ] denotes the expected value of a random variable given that the agent follows policy π π . Note that the value of the terminal state, if any, is always zero. We call the function vπ v π the state-value function for policy π π .

Similarly, we de fine the value of taking action a a in state s under a policy π π , denote qπ(s,a) q π ( s , a ) , as the expected return starting from s s , taking the action a, and thereafter following policy π π :

qπ(s,a)=Eπ[Gt|St=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a] q π ( s , a ) = E π [ G t | S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 | S t = s , A t = a ]
We call qπ q π the action-value function for policy π π .

The value functions can be estimated from experience.

vπ(s)=Eπ[Gt|St=s]=Eπ[Rt+1+γGt+1|St=s]=aπ(a|s)srp(s,r|s,a)[r+γEπ[Gt+1|St+1=s]]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)],sS v π ( s ) = E π [ G t | S t = s ] = E π [ R t + 1 + γ G t + 1 | S t = s ] = ∑ a π ( a | s ) ∑ s ′ ∑ r p ( s ′ , r | s , a ) [ r + γ E π [ G t + 1 | S t + 1 = s ′ ] ] = ∑ a π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v π ( s ′ ) ] , ∀ s ∈ S
where it is implicit that the actions a a are taken from the set A(s), taht the next states, s s ′ , are taken fromt he set S S and the rewards, r r , are taken from the set R. Equation above is the Bellman equation for vπ v π .

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. There is always at least one policy that is better than or equal to all other policies, i.e. optimal policy, denoted by π π ∗ . Although there may be more than one, they shared the same state-value function, called the optimal state-value function, denoted v v ∗ , and defined as

v(s)=maxπvπ(s),sS v ∗ ( s ) = max π v π ( s ) , ∀ s ∈ S
Optimal policies also share the same optimal action-value function, denoted q q ∗ , and defined as
q(s,a)=maxπqπ(s,a),sS,aA(s) q ∗ ( s , a ) = max π q π ( s , a ) , ∀ s ∈ S , a ∈ A ( s )
We can write q q ∗ in terms of v v ∗ as follows:
q(s,a)=E[Rt+1+γv(St+1)|St=s,At=a] q ∗ ( s , a ) = E [ R t + 1 + γ v ∗ ( S t + 1 ) | S t = s , A t = a ]

Because v v ∗ is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation. Because it is the optimal value function, however, v v ∗ ’s consistency condition can be written in a special form without reference to any speci c policy. This is the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:

v(s)=maxaA(s)qπ(s,a)=maxaEπ[Gt|St=s,At=a]=maxaEπ[Rt+1+γGt+1|St=s,At=a]=maxaEπ[Rt+1+γv(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γv(s)] v ∗ ( s ) = max a ∈ A ( s ) q π ∗ ( s , a ) = max a E π ∗ [ G t | S t = s , A t = a ] = max a E π ∗ [ R t + 1 + γ G t + 1 | S t = s , A t = a ] = max a E π ∗ [ R t + 1 + γ v ∗ ( S t + 1 ) | S t = s , A t = a ] = max a ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v ∗ ( s ′ ) ]
The last two equations are two forms of the Bellman optimality equation for v v ∗ .

The Bellman optimality equation for q q ∗ is

q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]=s,rp(s,r|s,a)[r+γmaxaq(s,a)] q ∗ ( s , a ) = E [ R t + 1 + γ max a ′ q ∗ ( S t + 1 , a ′ ) | S t = s , A t = a ] = ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ max a ′ q ∗ ( s ′ , a ′ ) ]

Dynamic Programming

Assume that the environment is a finite MDP.

Policy Evaluation - Prediction Problem

First we consider how to compute the state-value function vπ v π for an arbitrary policy π π . This is called policy evaluation in dynamic programming literature. If the environment’s dynamics are completely known, then let the initial approximation, v0 v 0 , is chosen arbitrarily ( except that the terminal state, if any, must be given value 0 0 ), and each successive approximation is obtained by using the Bellman equation for vπ as an update rule:

vk+1(s)=Eπ[Rt+1+γvk(St+1)|St=s]=aπ(a|s)s,rp(s,r|s,a)[r+γvk(s)] v k + 1 ( s ) = E π [ R t + 1 + γ v k ( S t + 1 ) | S t = s ] = ∑ a π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v k ( s ′ ) ]
The sequence {vk} { v k } can be shown in general to converge to vpi v p i as k k → ∞ under the same conditions that guarantee the existence of vπ v π . This algorithm is called iterative policy evaluation. It replaces the old value of s s with a new value obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. We call this kind of operation an expected update. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

Policy Improvement

Our reason for computing the value function for a policy is to help nd better policies. Suppose we have determined the value function vπ v π for an arbitrary deterministic policy π π . For some state s s we would like to know whether or not we should change the policy to deterministically choose an action aπ(s). We know how good it is to follow the current policy from s s |that is vπ(s) but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting a a in s and thereafter following the existing policy π π .

The key criterion is whether this is greater than or less than vπ(s) v π ( s ) . If it is greater — that is, if it is better to select a a once in s and thereafter follow π π than it would be to follow π π all the time — then one would expect it to be better still to select a every time s s is encountered, and that the new policy would in fact be a better one overall. be a better one overall. That this is true is a special case of a general result called the policy improvement theorem.

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

vπ(s)=maxaE[Rt+1+γvπ(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γvπ(s)]

Policy Iteration

Once a policy, π π , has been improved using vπ v π to yield a better policy, π π ′ , we can then compute vπ v π ′ and improve it again to yield an even better π′′ π ″ . We can thus obtain a sequence of monotonically improving policies and value functions.
Eval
This way of nding an optimal policy is called policy iteration.

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:

vk+1(s)=maxaE[Rt+1+γvk(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γvk(s)] v k + 1 ( s ) = max a E [ R t + 1 + γ v k ( S t + 1 ) | S t = s , A t = a ] = max a ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v k ( s ′ ) ]
for all sS s ∈ S . For arbitrary v0 v 0 , the sequence {vk} { v k } can be shown to converge to v v ∗ under the same conditions that guarantee the existence of v v ∗ .

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identi able policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right.
Itera
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite effcient.

Convergence Proof

Here we give a proof of the convergence of the policy evaluation process. The proof is based on contraction mapping and fixed point principle, but we do not discuss the mathematic basics.

  • Definition: Let T T be a metric space with metric ρ. Mapping T:XX T : X → X . If there exists a,0a<1 a , 0 ≤ a < 1 satisfying ρ(Tx,Ty)aρ(x,y),x,yX ρ ( T x , T y ) ≤ a ρ ( x , y ) , ∀ x , y ∈ X , then T T is a contraction mapping on space X.
  • Definition: If there exists x0X x 0 ∈ X satisfying Tx0=x0 T x 0 = x 0 , then x0 x 0 is the fixed point of T T .
  • Theorem: There is only ONE fixed point of some contraction mapping in a complete metric space.

To prove that some iteration sequence is convergent, we only have to prove that the corresponded mapping is a contraction mapping.

vπ(s)=aAπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)]
Take the infinite norm as the metric:
v=maxsSv(s) ‖ v ‖ ∞ = max s ∈ S ‖ v ( s ) ‖
Then
Tπ(u)Tπ(v)=maxsaAπ(a|s)s,rp(s,r|s,a)[r+γuπ(s)]aAπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)]=maxsaAπ(a|s)s,rp(s,r|s,a)γuπ(s)aAπ(a|s)s,rp(s,r|s,a)γvπ(s)γmaxss,rp(s,r|s,a)uπ(s)s,rp(s,r|s,a)vπ(s)γmaxsmaxsuπ(s)vπ(s)γmaxsuπ(s)vπ(s)=γuv ‖ T π ( u ) − T π ( v ) ‖ ∞ = max s ‖ ∑ a ∈ A π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ u π ( s ′ ) ] − ∑ a ∈ A π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v π ( s ′ ) ] ‖ = max s ‖ ∑ a ∈ A π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) γ u π ( s ′ ) − ∑ a ∈ A π ( a | s ) ∑ s ′ , r p ( s ′ , r | s , a ) γ v π ( s ′ ) ‖ ≤ γ max s ‖ ∑ s ′ , r p ( s ′ , r | s , a ) u π ( s ′ ) − ∑ s ′ , r p ( s ′ , r | s , a ) v π ( s ′ ) ‖ ≤ γ max s ‖ max s ′ ‖ u π ( s ′ ) − v π ( s ′ ) ‖ ‖ ≤ γ max s ‖ u π ( s ) − v π ( s ) ‖ = γ ‖ u − v ‖ ∞
Done.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值