2.2 DP: Value Iteration & Gambler‘s Problem

目录

Value Iteration

Background

Definition

Pesudo code 

Gambler's Problem

Case

Case analysis

Code

Result


Value Iteration

Background

Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged. 

Definition

We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.

In extreme situations, we can make policy improvement just after one update of one state. 

Pesudo code 

while judge == True:

        judge = False;

        for s\in S:

                v_{old}\ =\ v(s)

                v(s)\ =\ max_a \sum_{s^{'},r^{'}} \ p(s^{'},r^{'}|s,a)[r+\gamma v(s^{'})]

                if |\ v(s) - v_{old} \ | > \theta:

                        judge = True;

# optimal policy 

\pi(s)\ =\ argmax_a \ sum_{s^{'},a^{'}} \ p(s^{'},a^{'}|s,a)[r+\gamma v(s^{'})]

Gambler's Problem

Case

Case analysis

S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)

A: the money we stake: 0-min( s,100-s ) 

R:  no imtermediate reward, expect v(100)=1

S^{'}:  S+A or S-A, which depends on the probability: p_h (win)

Code

### settings


import math
import numpy
import random

# visualization 
import matplotlib 
import matplotlib.pyplot as plt

# global settings

MAX_MONEY = 100 ;

MIN_MONEY = 0 ; 

P_H = 0.4 ; 

gamma = 1 ;  # episodic

# accuracy of value function
error = 0.001; 

### functions

# max value function
def max_v_a(v,s):
	MIN_ACTION = 0 ;
	MAX_ACTION = min( s, MAX_MONEY-s ) ;
	
	v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
	
	for a in range(MIN_ACTION, MAX_ACTION+1):
		v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
		(1- P_H)*( 0 + gamma*v[s-a] ) );
	
	return max(v_a)

# max value function index
def argmax_a(v,s):
	MIN_ACTION = 0 ;
	MAX_ACTION = min( s, MAX_MONEY-s ) ;
	
	v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
	
	for a in range(MIN_ACTION, MAX_ACTION+1):
		v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
		(1- P_H)*( 0 + gamma*v[s-a] ) );
	
	return ( numpy.argmax(v_a) )

# visualization 

def visualization(v_set,policy):
	
	fig,axes = plt.subplots(2,1)
	for i in range(0,len(v_set)):
		axes[0].plot(v_set[i],linewidth=3)
	#	plt.pause(0.5)
	axes[0].set_title('value function')
	
	
	axes[1].plot(range(1,len(policy)+1),policy)
	axes[1].set_title('policy')
	plt.show()

### main programming

# policy 
policy = numpy.zeros(MAX_MONEY-1);

# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);

#every_sweep_of value function
v_set = [] ;

# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;

judge = True;


# value iteration

while(judge):
	judge = False
	for s in range(MIN_MONEY+1,MAX_MONEY):
		v_old = v[s];
		v[s] = max_v_a(v,s);
		if math.fabs( v[s] - v_old ) > error:
			judge = True;
	v_set.append(v.copy())
	
# optimal policy 

for s in range(MIN_MONEY+1,MAX_MONEY):
	policy[s-1] = argmax_a(v,s)


# visualization

visualization(v_set,policy)

Result

for p_h =0.4 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值