batch manufacturing problem
《Dynamic Programming and Optimal Volume》习题7.8
- A Manufacturer at each time period receives an order for her product with probability p and receives no order with probability 1 - p
- At any period, she has a choice of processing all the unfilled orders in a batch, or process no order at all. The maximum number of orders that can remain unfilled is n
- The cost per unfilled order at each time period is c > 0, the setup cost to process the unfilled orders is K > 0
- The manufacturer wants to find a processing policy that minimizes the total expected cost with discount factor α < 1
实现value iteration(值迭代)和policy iteration(策略迭代)两种算法
其中
c = 1 K = 5 n = 10 p = 0.5 alpha = 0.9
值迭代算法
算法简介
J k + 1 ( i ) = min u ∈ U ( i ) [ g ( i , u ) + ∑ j = 1 n p i j ( u ) J k ( j ) ] , ∀ i J_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^{n} p_{i j}(u) J_{k}(j)\right], \forall i Jk+1(i)=u∈U(i)min[g(i,u)+j=1∑npij(u)Jk(j)],∀i
for any initial conditions, for instance
J
0
(
i
)
=
0
J_{0}(i)=0
J0(i)=0.
It is guaranteed that
lim
k
→
∞
J
k
(
i
)
=
J
∗
(
i
)
\lim _{k \rightarrow \infty} J_{k}(i)=J^{*}(i)
limk→∞Jk(i)=J∗(i)
实现
def value_iteration(c, K, n, p, alpha=1):
Jk = np.zeros(n+1)
Jkp1 = np.zeros(n+1)
threshold = 1e-10
k=0
while k<1000:
k+=1
for i in range(n):
pro = K + alpha*(1-p)*Jk[0] + alpha*p*Jk[1]
unpro = c*i + alpha*(1-p)*Jk[i] + alpha*p*Jk[i+1]
Jkp1[i] = min(pro, unpro)
Jkp1[n] = K + alpha*(1-p)*Jk[0] + alpha*p*Jk[1]
if np.sum( np.fabs( Jkp1[n]-Jk[n] ) )<threshold:
break
Jk = Jkp1.copy()
print("the result of value_itreation is: ")
print(Jk)
策略迭代
算法简介
J μ ( i ) = g ( i , μ ( i ) ) + ∑ j = 1 p i j ( μ ( i ) ) J μ ( j ) , i = 1 , … , n J_{\mu}(i)=g(i, \mu(i))+\sum_{j=1} p_{i j}(\mu(i)) J_{\mu}(j), i=1, \ldots, n Jμ(i)=g(i,μ(i))+j=1∑pij(μ(i))Jμ(j),i=1,…,n
has a unique solution
J
μ
(
i
)
,
i
=
1
,
…
,
n
J_{\mu}(i), i=1, \ldots, n
Jμ(i),i=1,…,n
Policy iteration method:
-
Policy evaluation: Solve the above linear equations for μ k \mu^{k} μk to obtain J μ k ( i ) , i = 1 , … , n J_{\mu^{k}}(i), i=1, \ldots, n Jμk(i),i=1,…,n (迭代法求解线性方程组)
-
Policy improvement: Find an improved policy
μ k + 1 ( i ) = arg min u ∈ U ( i ) [ g ( i , u ) + ∑ j = 1 n p i j ( u ) J μ k ( j ) ] , ∀ i \mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^{n} p_{i j}(u) J_{\mu^{k}}(j)\right], \forall i μk+1(i)=argu∈U(i)min[g(i,u)+j=1∑npij(u)Jμk(j)],∀i -
Terminate condition: J μ k + 1 ( i ) = J μ k ( i ) J_{\mu^{k+1}}(i)=J_{\mu^{k}}(i) Jμk+1(i)=Jμk(i) for all i i i
实现
def value_function(c, K, policy, n, p, alpha):
value_table = np.ones(n)
threshold = 1e-10
k = 0
while k<1000:
new_value_table = value_table.copy()
for i in range(n):
action = policy[i]
if action: # process
value_table[i] = K + alpha*(1-p)*new_value_table[0] + alpha*p*new_value_table[1]
else: # not process
value_table[i] = c*i + alpha*(1-p)*new_value_table[i] + alpha*p*new_value_table[i+1]
if (np.sum((np.fabs(new_value_table-value_table))) <= threshold):
break
k+=1
return value_table
def policy_iteration(c, K, n, p, alpha=1):
policy = np.ones(n)
new_policy = np.ones(n)
iter_num = 100 # number of iteration
k = 0
while k < iter_num:
k += 1
# compute value
value_table = value_function(c, K, policy, n, p, alpha)
# improve policy
# 计算 当前状态i 做出 动作act 后的值
for i in range(n-1):
pro = K + alpha*((1-p)*value_table[0] + p*value_table[1])
unpro = c*i + alpha*( (1-p)*value_table[i] + p*value_table[i+1] )
if pro<unpro:
new_policy[i] = 1
else:
new_policy[i] = 0
new_policy[n-1] = 1
if np.all(policy == new_policy):
break
policy = new_policy.copy()
print("the policy of discount problem of alpha = " + str(alpha) + " is: ")
print(policy)
return new_policy