2.2 DP: Value Iteration & Gambler‘s Problem

最新推荐文章于 2021-08-16 14:26:10 发布

最適当承诺

最新推荐文章于 2021-08-16 14:26:10 发布

阅读量158

点赞数

分类专栏：强化学习学习笔记文章标签：动态规划价值迭代赌博者问题强化学习最优策略

本文链接：https://blog.csdn.net/upr_rom/article/details/118806508

版权

强化学习学习笔记专栏收录该内容

10 篇文章 8 订阅

订阅专栏

Value Iteration

Background

Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.

Definition

We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.

In extreme situations, we can make policy improvement just after one update of one state.

Pesudo code

while judge == True:

judge = False;

for $s\in S$ :

$v_{old}\ =\ v(s)$

$v(s)\ =\ max_a \sum_{s^{'},r^{'}} \ p(s^{'},r^{'}|s,a)[r+\gamma v(s^{'})]$

if $|\ v(s) - v_{old} \ | > \theta$ :

judge = True;

# optimal policy

$\pi(s)\ =\ argmax_a \ sum_{s^{'},a^{'}} \ p(s^{'},a^{'}|s,a)[r+\gamma v(s^{'})]$

Gambler's Problem

Case

Case analysis

S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)

A: the money we stake: 0-min( s,100-s )

R: no imtermediate reward, expect v(100)=1

$S^{'}$ : S+A or S-A, which depends on the probability: p_h (win)

Code

### settings


import math
import numpy
import random

# visualization 
import matplotlib 
import matplotlib.pyplot as plt

# global settings

MAX_MONEY = 100 ;

MIN_MONEY = 0 ; 

P_H = 0.4 ; 

gamma = 1 ;  # episodic

# accuracy of value function
error = 0.001;

### functions

# max value function
def max_v_a(v,s):
	MIN_ACTION = 0 ;
	MAX_ACTION = min( s, MAX_MONEY-s ) ;
	
	v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
	
	for a in range(MIN_ACTION, MAX_ACTION+1):
		v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
		(1- P_H)*( 0 + gamma*v[s-a] ) );
	
	return max(v_a)

# max value function index
def argmax_a(v,s):
	MIN_ACTION = 0 ;
	MAX_ACTION = min( s, MAX_MONEY-s ) ;
	
	v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
	
	for a in range(MIN_ACTION, MAX_ACTION+1):
		v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
		(1- P_H)*( 0 + gamma*v[s-a] ) );
	
	return ( numpy.argmax(v_a) )

# visualization 

def visualization(v_set,policy):
	
	fig,axes = plt.subplots(2,1)
	for i in range(0,len(v_set)):
		axes[0].plot(v_set[i],linewidth=3)
	#	plt.pause(0.5)
	axes[0].set_title('value function')
	
	
	axes[1].plot(range(1,len(policy)+1),policy)
	axes[1].set_title('policy')
	plt.show()

### main programming

# policy 
policy = numpy.zeros(MAX_MONEY-1);

# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);

#every_sweep_of value function
v_set = [] ;

# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;

judge = True;


# value iteration

while(judge):
	judge = False
	for s in range(MIN_MONEY+1,MAX_MONEY):
		v_old = v[s];
		v[s] = max_v_a(v,s);
		if math.fabs( v[s] - v_old ) > error:
			judge = True;
	v_set.append(v.copy())
	
# optimal policy 

for s in range(MIN_MONEY+1,MAX_MONEY):
	policy[s-1] = argmax_a(v,s)


# visualization

visualization(v_set,policy)

Result

for p_h =0.4

最適当承诺

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2.2 DP: Value Iteration & Gambler‘s Problem

Value IterationBackgroundPolicy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even nothe last many sweeps of policy evaluation,.
复制链接

扫一扫