batch manufacturing problem

batch manufacturing problem

《Dynamic Programming and Optimal Volume》习题7.8

  • A Manufacturer at each time period receives an order for her product with probability p and receives no order with probability 1 - p
  • At any period, she has a choice of processing all the unfilled orders in a batch, or process no order at all. The maximum number of orders that can remain unfilled is n
  • The cost per unfilled order at each time period is c > 0, the setup cost to process the unfilled orders is K > 0
  • The manufacturer wants to find a processing policy that minimizes the total expected cost with discount factor α < 1

实现value iteration(值迭代)和policy iteration(策略迭代)两种算法

其中

c = 1    K = 5    n = 10    p = 0.5    alpha = 0.9

值迭代算法

算法简介

J k + 1 ( i ) = min ⁡ u ∈ U ( i ) [ g ( i , u ) + ∑ j = 1 n p i j ( u ) J k ( j ) ] , ∀ i J_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^{n} p_{i j}(u) J_{k}(j)\right], \forall i Jk+1(i)=uU(i)min[g(i,u)+j=1npij(u)Jk(j)],i

for any initial conditions, for instance J 0 ( i ) = 0 J_{0}(i)=0 J0(i)=0.
It is guaranteed that lim ⁡ k → ∞ J k ( i ) = J ∗ ( i ) \lim _{k \rightarrow \infty} J_{k}(i)=J^{*}(i) limkJk(i)=J(i)

实现
def value_iteration(c, K, n, p, alpha=1):
    Jk = np.zeros(n+1)
    Jkp1 = np.zeros(n+1)
    threshold = 1e-10
    k=0
    while k<1000:
        k+=1
        for i in range(n):
            pro = K + alpha*(1-p)*Jk[0] + alpha*p*Jk[1]
            unpro = c*i + alpha*(1-p)*Jk[i] + alpha*p*Jk[i+1]
            Jkp1[i] = min(pro, unpro)
        Jkp1[n] = K + alpha*(1-p)*Jk[0] + alpha*p*Jk[1]
        if np.sum( np.fabs( Jkp1[n]-Jk[n] ) )<threshold:
            break
        Jk = Jkp1.copy()    
    print("the result of value_itreation is: ")
    print(Jk)

策略迭代

算法简介

J μ ( i ) = g ( i , μ ( i ) ) + ∑ j = 1 p i j ( μ ( i ) ) J μ ( j ) , i = 1 , … , n J_{\mu}(i)=g(i, \mu(i))+\sum_{j=1} p_{i j}(\mu(i)) J_{\mu}(j), i=1, \ldots, n Jμ(i)=g(i,μ(i))+j=1pij(μ(i))Jμ(j),i=1,,n

has a unique solution J μ ( i ) , i = 1 , … , n J_{\mu}(i), i=1, \ldots, n Jμ(i),i=1,,n
Policy iteration method:

  • Policy evaluation: Solve the above linear equations for μ k \mu^{k} μk to obtain J μ k ( i ) , i = 1 , … , n J_{\mu^{k}}(i), i=1, \ldots, n Jμk(i),i=1,,n (迭代法求解线性方程组)

  • Policy improvement: Find an improved policy
    μ k + 1 ( i ) = arg ⁡ min ⁡ u ∈ U ( i ) [ g ( i , u ) + ∑ j = 1 n p i j ( u ) J μ k ( j ) ] , ∀ i \mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^{n} p_{i j}(u) J_{\mu^{k}}(j)\right], \forall i μk+1(i)=arguU(i)min[g(i,u)+j=1npij(u)Jμk(j)],i

  • Terminate condition: J μ k + 1 ( i ) = J μ k ( i ) J_{\mu^{k+1}}(i)=J_{\mu^{k}}(i) Jμk+1(i)=Jμk(i) for all i i i

实现
def value_function(c, K, policy, n, p, alpha):
    value_table = np.ones(n)
    threshold = 1e-10
    k = 0
    while k<1000:
        new_value_table = value_table.copy()
        for i in range(n):
            action = policy[i]
            
            if action:  # process
                value_table[i] = K + alpha*(1-p)*new_value_table[0] + alpha*p*new_value_table[1]

            else:  # not process
                value_table[i] = c*i + alpha*(1-p)*new_value_table[i] + alpha*p*new_value_table[i+1]

        if (np.sum((np.fabs(new_value_table-value_table))) <= threshold):
            break
        
        k+=1
    return value_table


def policy_iteration(c, K, n, p, alpha=1):
    policy = np.ones(n)
    new_policy = np.ones(n)
    iter_num = 100  # number of iteration
    k = 0
    while k < iter_num:
        k += 1
        # compute value
        value_table = value_function(c, K, policy, n, p, alpha)
        
        # improve policy
        # 计算 当前状态i 做出 动作act 后的值
        for i in range(n-1):
            pro = K + alpha*((1-p)*value_table[0] + p*value_table[1])
            unpro = c*i + alpha*( (1-p)*value_table[i] + p*value_table[i+1] )
            if pro<unpro:
                new_policy[i] = 1
            else:
                new_policy[i] = 0
        new_policy[n-1] = 1
        if np.all(policy == new_policy):
            break
        policy = new_policy.copy()
        
    print("the policy of discount problem of alpha = " + str(alpha) + " is: ")
    print(policy)

    return new_policy
  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值