Finite Markov Decision Processes

A Markov process is a memoryless random process, i.e. a sequence of random states S1,S2, S 1 , S 2 , ⋯ with the Markov property.

  • Definition
    A Markov Process (or Markov Chain) is a tuple (S,P) ( S , P )
    • S S is a set of states (finite).
    • P is a state transition probability matrix, Ps,s=P[St+1=sSt=s] P s , s ′ = P [ S t + 1 = s ′ ∣ S t = s ]

Markov Decision Processes are a classical formalization of sequential decision making. MDPs are a mathmatically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability.

MDP formally describe an environment for RL, where the environment is fully observable.

1. The Agent-Environment Interface

The learner and decision maker is called the agent.
The thing it interacts with, comprising everything outside the agent, is called the environment.
(We uses the terms agent, environment, and action instead of the engineers’ terms controller, controlled system(or plant), and control signal because they are meaningful to a wider audience.)

In particular, the boundary between agent and environment is typically not the same as the physical boudary of robot’s or animal body. The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.



Figure 1: The agent-environment interaction in a Markov decision process.

The probabilities given by the four-argument function p p completely characterize the dynamics of a finite MDP:

p(s,rs,a)Pr{St=s,Rt=rSt1=s,At1=a}

From it, we can compute anything else one might want to know about the environment.
For example:

  • the state-transition probabilities
  • the expected rewards for state-action pairs
  • the expected rewards for state-action-next-state triples

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. Any problem of learning goal-directed behevior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which choices are made (the states), and one signal to defineteh agent’s goal (the rewards).

2. Goals and Rewards

That all of what we mean by goals and purposes can be well thought of as the maximization of othe expected value of the cumulative sum of a receive scalar signal (called reward).

It is thus critical that the rewards we set up truly indicate what we want accomplished. In paricular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.

3. Returns and Episodes

In general, we seek to maximize the expected return, where the return, denoted Gt G t , is defined as some specific function of the reward sequence.

GtRt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1 G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1

The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γk1 γ k − 1 times what it would be worth if it were reveived immediately.
If γ<1 γ < 1 , the infibite sum above have a finite value as long as the reward sequence Rk R k bounded.
If γ=0 γ = 0 , the agent is “myopic” in being concerned only with maximizing immediate rewards.

Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning:

GtRt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4+)=Rt+1+γGt+1 G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + γ 2 R t + 4 + ⋯ ) = R t + 1 + γ G t + 1

5. Policies and Value Functions

Almost all reinforcement learning algorithms invlove estimating value functions—functions of states (or of state-action pairs) that esimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

Formally, a policy is a mapping from states to probabilities of selecting each possible action.

π(as)=Pr(A=aS=s) π ( a ∣ s ) = P r ( A = a ∣ S = s )

Value function is a prediction of future reward, which is used to .evaluate the goodness/badness of states.

The value of a state s s under a policy π, denoted vπ(s) v π ( s ) , is the expected return when starting in s s and following π thereafter.
state-value function for policy π π :

vπ(s)Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s] v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ]

Similarly, we difine the value of taking action a a in state s under a policy π π , denoted qπ(s,a) q π ( s , a ) , as the expected return starting from s s , taking the action a, and thereafter following policy π π .
action-value function for policy π π :

qπ(s,a)Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a] q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ]

The relationship between state-value function and action-value function:

vπ(s)=aAπ(as)qπ(s,a) v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a )

For any policy π π and any state s s , the following consistency condition holds between the value of s and the value of its possible successor states:
vπ(s)Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]=Eπ[Rt+1+γvπ(St+1)St=s] v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = E π [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ]

=aπ(as)srp(s,rs,a)[r+γEπ[Gt+1St+1=s]] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] ]

=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] ]

Equation above is the Bellman equation for vπ v π . It expresses a relationship between the value of a state and the values of its successor states.
Bellman equation forms the basis of a number of ways to compute, approximate and learn vπ v π . The existence and uniqueness of vπ v π are guaranteed as long as either γ<1 γ < 1 or eventual termination is guaranteed from all states under the policy π π .

A fundamental property of value function used throughout RL is that they satisfy recursive relationships similar to that which we have already established for the return.

6. Optimal Policies and Optimal Value Functions

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run.
There is always at least one policy that is better than or equal to all other policies called optimal policy, denoted as π π ∗ .
The defination of “better”: π>π π > π ′ if and only if vπ(s)vπ(s) v π ( s ) ≥ v π ′ ( s ) , for all sS s ∈ S .
π π ∗ share the same state-value function, called the optimal state-value function, denoteed v v ∗ , and defined as

v(s)maxπvπ(s),for all sS v ∗ ( s ) ≐ max π v π ( s ) , for all  s ∈ S

Optimal policies also share the same optimal action-value function, denoted q q ∗ , and difined as
q(s,a)maxπqπ(s,a),for all sS,aA q ∗ ( s , a ) ≐ max π q π ( s , a ) , for all  s ∈ S , a ∈ A

We can write q q ∗ in terms of v v ∗ as follows:
q(s,a)=E[Rt+1+γv(St+1)St=s,At=a] q ∗ ( s , a ) = E [ R t + 1 + γ v ∗ ( S t + 1 ) ∣ S t = s , A t = a ]

Bellman Optimality Equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
v(s)=maxaA(s)qπ(s,a)=maxaEπ[GtSt=s,At=a]=maxaEπ[Rt+1+γGt+1St=s,At=a] v ∗ ( s ) = max a ∈ A ( s ) q π ∗ ( s , a ) = max a E π ∗ [ G t ∣ S t = s , A t = a ] = max a E π ∗ [ R t + 1 + γ G t + 1 ∣ S t = s , A t = a ]

=maxaEπ[Rt+1+γv(St+1)St=s,At=a]=maxas,rp(s,rs,a)[r+γv(s)] = max a E π ∗ [ R t + 1 + γ v ∗ ( S t + 1 ) ∣ S t = s , A t = a ] = max a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ]

The Bellman Optimality Equation for q q ∗ is

q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=a,At=a]=s,rp(s,rs,a)[r+γmaxaq(s,a)] q ∗ ( s , a ) = E [ R t + 1 + γ max a ′ q ∗ ( S t + 1 , a ′ ) ∣ S t = a , A t = a ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ max a ′ q ∗ ( s ′ , a ′ ) ]

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值