Basic Model
There are two principal features:
- an underlying discrete time dynamic system
- a cost function that is additive over time
The system has the form
x
k
+
1
=
f
k
(
x
k
,
u
k
,
w
k
)
x_{k+1}=f_k(x_k, u_k, w_k)
xk+1=fk(xk,uk,wk)
where
- x k x_k xk is the state of the system and summarizes past information that is relevant for future optimization.
- u k u_k uk is the control or decision variable to be selected at time k k k.
- w k w_k wk is a random parameter (disturbance or noise depending on the context).
- N N N is the horizon or number of times control is applied.
- f k f_k fk is a function that describes the system and in particular the mechanism by which the state is updated.
Cost function is additive denoted by
g
k
(
x
k
,
u
k
,
w
k
)
g_k(x_k, u_k, w_k)
gk(xk,uk,wk), accumulates over time
g
N
(
x
N
)
+
∑
k
=
0
N
−
1
g
k
(
x
k
,
u
k
,
w
k
)
g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, u_k, w_k)
gN(xN)+k=0∑N−1gk(xk,uk,wk)
where
g
N
(
x
N
)
g_N(x_N)
gN(xN) is a terminal cost incurred at the end of the process. Due to
w
k
w_k
wk is a random term, we therefore formulate the problem as an optimization of the expected cost
E
{
g
N
(
x
N
)
+
∑
k
=
0
N
−
1
g
k
(
x
k
,
u
k
,
w
k
)
}
\mathbb{E}\bigg\{ g_N(x_N)+\sum_{k=0}^{N-1} g_k(x_k, u_k, w_k)\bigg\}
E{gN(xN)+k=0∑N−1gk(xk,uk,wk)}
where the expectation is with respect to the joint distribution of the random variables involved. Each control
u
k
u_k
uk is selected with some knowledge of the current state
x
k
x_k
xk.
Open-loop & Closed-loop
In Open-loop minimization we select all orders u 0 , u 1 , … , u N − 1 u_0, u_1, \dots, u_{N-1} u0,u1,…,uN−1 at once at time 0, without waiting to see the subsequent demand levels.
In Closed-loop minimization we postpone placing the order u k u_k uk util the last possible moment (time k k k) when the current stock x k x_k xk will be known.
In particular, in closed-loop inventory optimization we are not interested in finding optimal numerical values of the orders but rather we want to find an optimal rule for selecting at each period k k k an order u k u_k uk for each possible value of stock x k x_k xk that can conceivably occur.
State transition
p
i
j
(
u
,
k
)
p_{ij}(u, k)
pij(u,k) is the probability at time
k
k
k that the next state will be
j
j
j, given that the current state is
i
i
i, and the control selected is
u
u
u, i.e.
p
i
j
(
u
,
k
)
=
P
(
x
k
+
1
=
j
∣
x
k
=
i
,
u
k
=
u
)
p_{ij}(u, k)=\mathbb{P}(x_{k+1}=j\mid x_k=i, u_k=u)
pij(u,k)=P(xk+1=j∣xk=i,uk=u)
We consider the class of policies (control laws) that consist of a sequence of functions
π
=
{
μ
0
,
…
,
μ
N
−
1
}
\pi=\{\mu_0, \dots, \mu_{N-1}\}
π={μ0,…,μN−1}
where
μ
k
\mu_k
μk maps states
x
k
x_k
xk into controls
u
k
=
μ
k
(
x
k
)
u_k=\mu_k(x_k)
uk=μk(xk) and is such that
μ
k
(
x
k
)
∈
U
k
(
x
k
)
\mu_k(x_k)\in U_k(x_k)
μk(xk)∈Uk(xk) for all
x
k
∈
S
k
x_k\in S_k
xk∈Sk. Such policies will be called admissible.
Given an initial state
x
0
x_0
x0 and an admissible policy
π
=
{
μ
0
,
…
,
μ
N
−
1
}
\pi=\{\mu_0, \dots, \mu_{N-1}\}
π={μ0,…,μN−1}, the states
x
k
x_k
xk and disturbances
w
k
w_k
wk are random variables with disturbations defined through the system equation
x
k
+
1
=
f
k
(
x
k
,
μ
k
(
x
k
)
,
w
k
)
x_{k+1}=f_k(x_k, \mu_k(x_k), w_k)
xk+1=fk(xk,μk(xk),wk)
Thus, for given functions
g
k
,
k
=
0
,
1
,
…
,
N
g_k, k=0,1,\dots, N
gk,k=0,1,…,N, the expected cost of
π
\pi
π starting at
x
0
x_0
x0 is
J
π
(
x
0
)
=
E
{
g
N
(
x
N
)
+
∑
k
=
0
N
−
1
g
k
(
x
k
,
μ
k
(
x
k
)
,
w
k
)
}
J_\pi(x_0)=\mathbb{E} \bigg\{ g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, \mu_k(x_k), w_k) \bigg\}
Jπ(x0)=E{gN(xN)+k=0∑N−1gk(xk,μk(xk),wk)}
An optimal policy
π
∗
\pi^*
π∗ is one that minimize the cost
J
π
∗
=
min
π
∈
Π
J
π
(
x
0
)
J_{\pi^*}=\min_{\pi\in\Pi} J_\pi(x_0)
Jπ∗=π∈ΠminJπ(x0)
An interesting aspect of the basic problem and of dynamic programming is that it is typically possible to find a policy
π
∗
\pi^*
π∗ that is simultaneously optional for all initial states.
Reference
Dynamic Programming and Optimal Control