【DP】The Dynamic Programming Algorithm

Basic Model

There are two principal features:

  1. an underlying discrete time dynamic system
  2. a cost function that is additive over time

The system has the form
x k + 1 = f k ( x k , u k , w k ) x_{k+1}=f_k(x_k, u_k, w_k) xk+1=fk(xk,uk,wk)
where

  • x k x_k xk is the state of the system and summarizes past information that is relevant for future optimization.
  • u k u_k uk is the control or decision variable to be selected at time k k k.
  • w k w_k wk is a random parameter (disturbance or noise depending on the context).
  • N N N is the horizon or number of times control is applied.
  • f k f_k fk is a function that describes the system and in particular the mechanism by which the state is updated.

Cost function is additive denoted by g k ( x k , u k , w k ) g_k(x_k, u_k, w_k) gk(xk,uk,wk), accumulates over time
g N ( x N ) + ∑ k = 0 N − 1 g k ( x k , u k , w k ) g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, u_k, w_k) gN(xN)+k=0N1gk(xk,uk,wk)
where g N ( x N ) g_N(x_N) gN(xN) is a terminal cost incurred at the end of the process. Due to w k w_k wk is a random term, we therefore formulate the problem as an optimization of the expected cost
E { g N ( x N ) + ∑ k = 0 N − 1 g k ( x k , u k , w k ) } \mathbb{E}\bigg\{ g_N(x_N)+\sum_{k=0}^{N-1} g_k(x_k, u_k, w_k)\bigg\} E{gN(xN)+k=0N1gk(xk,uk,wk)}
where the expectation is with respect to the joint distribution of the random variables involved. Each control u k u_k uk is selected with some knowledge of the current state x k x_k xk.

Open-loop & Closed-loop

In Open-loop minimization we select all orders u 0 , u 1 , … , u N − 1 u_0, u_1, \dots, u_{N-1} u0,u1,,uN1 at once at time 0, without waiting to see the subsequent demand levels.

In Closed-loop minimization we postpone placing the order u k u_k uk util the last possible moment (time k k k) when the current stock x k x_k xk will be known.

In particular, in closed-loop inventory optimization we are not interested in finding optimal numerical values of the orders but rather we want to find an optimal rule for selecting at each period k k k an order u k u_k uk for each possible value of stock x k x_k xk that can conceivably occur.

State transition

p i j ( u , k ) p_{ij}(u, k) pij(u,k) is the probability at time k k k that the next state will be j j j, given that the current state is i i i, and the control selected is u u u, i.e.
p i j ( u , k ) = P ( x k + 1 = j ∣ x k = i , u k = u ) p_{ij}(u, k)=\mathbb{P}(x_{k+1}=j\mid x_k=i, u_k=u) pij(u,k)=P(xk+1=jxk=i,uk=u)

We consider the class of policies (control laws) that consist of a sequence of functions
π = { μ 0 , … , μ N − 1 } \pi=\{\mu_0, \dots, \mu_{N-1}\} π={μ0,,μN1}
where μ k \mu_k μk maps states x k x_k xk into controls u k = μ k ( x k ) u_k=\mu_k(x_k) uk=μk(xk) and is such that μ k ( x k ) ∈ U k ( x k ) \mu_k(x_k)\in U_k(x_k) μk(xk)Uk(xk) for all x k ∈ S k x_k\in S_k xkSk. Such policies will be called admissible.

Given an initial state x 0 x_0 x0 and an admissible policy π = { μ 0 , … , μ N − 1 } \pi=\{\mu_0, \dots, \mu_{N-1}\} π={μ0,,μN1}, the states x k x_k xk and disturbances w k w_k wk are random variables with disturbations defined through the system equation
x k + 1 = f k ( x k , μ k ( x k ) , w k ) x_{k+1}=f_k(x_k, \mu_k(x_k), w_k) xk+1=fk(xk,μk(xk),wk)
Thus, for given functions g k , k = 0 , 1 , … , N g_k, k=0,1,\dots, N gk,k=0,1,,N, the expected cost of π \pi π starting at x 0 x_0 x0 is
J π ( x 0 ) = E { g N ( x N ) + ∑ k = 0 N − 1 g k ( x k , μ k ( x k ) , w k ) } J_\pi(x_0)=\mathbb{E} \bigg\{ g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, \mu_k(x_k), w_k) \bigg\} Jπ(x0)=E{gN(xN)+k=0N1gk(xk,μk(xk),wk)}
An optimal policy π ∗ \pi^* π is one that minimize the cost
J π ∗ = min ⁡ π ∈ Π J π ( x 0 ) J_{\pi^*}=\min_{\pi\in\Pi} J_\pi(x_0) Jπ=πΠminJπ(x0)
An interesting aspect of the basic problem and of dynamic programming is that it is typically possible to find a policy π ∗ \pi^* π that is simultaneously optional for all initial states.

Reference

Dynamic Programming and Optimal Control

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Quant0xff

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值