目录
2 The Wasserstein Metric
2.1 Basics
- Consider the Linear Programming (LP) problem:
W S , 1 ( P , Q ) = min π ∑ i = 1 m ∑ j = 1 n π ( i , j ) s ( i , j ) s.t. ∑ i = 1 m π ( i , j ) = q j , j ∈ ⟦ n ⟧ ∑ j = 1 n π ( i , j ) = p i , i ∈ ⟦ m ⟧ π ( i , j ) ≥ 0 , ∀ i , j \begin{aligned} W_{\mathbf{S}, 1}(\mathbb{P}, \mathbb{Q})=\min _{\pi} & \sum_{i=1}^{m} \sum_{j=1}^{n} \pi(i, j) s(i, j) \\ \text { s.t. } & \sum_{i=1}^{m} \pi(i, j)=q_{j}, \quad j \in \llbracket n \rrbracket \\ & \sum_{j=1}^{n} \pi(i, j)=p_{i}, \quad i \in \llbracket m \rrbracket \\ & \pi(i, j) \geq 0, \quad \forall i, j \end{aligned} WS,1(P,Q)=πmin s.t. i=1∑mj=1∑nπ(i,j)s(i,j)i=1∑mπ(i,j)=qj,j∈[[n]]j=1∑nπ(i,j)=pi,i∈[[m]]π(i,j)≥0,∀i,j
the objective value is the order-1 Wasserstein distance between distributions P \mathbb{P} P and Q \mathbb{Q} Q - Similarly, by defining a cost matrix
S
t
=
(
(
s
(
i
,
j
)
)
t
)
\mathbf{S}^{t}=\left((s(i, j))^{t}\right)
St=((s(i,j))t), t), we have the order-
t
t
t Wasserstein distance
W S , t ( P , Q ) = ( W S t , 1 ( P , Q ) ) 1 / t W_{\mathbf{S}, t}(\mathbb{P}, \mathbb{Q})=\left(W_{\mathbf{S}^{t}, 1}(\mathbb{P}, \mathbb{Q})\right)^{1 / t} WS,t(P,Q)=(WSt,1(P,Q))1/t - The above LP formulation is equivalent to the well-known transportation problem (Bertsimas and Tsitsiklis, 1997)
2.2 A Distance Metric
- In this section we establish that the Wasserstein distance W S , t ( P , Q ) W_{\mathbf{S}, t}(\mathbb{P}, \mathbb{Q}) WS,t(P,Q) is a distance metric, assuming that the underlying cost s ( i , j ) s(i, j) s(i,j) is a proper distance metric.
- W S , 1 ( P , Q ) W_{\mathbf{S}, 1}(\mathbb{P}, \mathbb{Q}) WS,1(P,Q), viewed as a function of the vectors p p p and q q q corresponding to P \mathbb{P} P and Q \mathbb{Q} Q, is a convex function.
2.3 The Dual Problem
- The dual of LP in section 2.1
W S , 1 ( P , Q ) = max f , g ∑ i = 1 m g i p i + ∑ j = 1 n f j q j s.t. f j + g i ≤ s ( i , j ) , i ∈ ⟦ m ⟧ , j ∈ ⟦ n ⟧ \begin{aligned} W_{\mathbf{S}, 1}(\mathbb{P}, \mathbb{Q})=\max _{\mathbf{f}, \mathbf{g}} & \sum_{i=1}^{m} g_{i} p_{i}+\sum_{j=1}^{n} f_{j} q_{j} \\ \text { s.t. } & f_{j}+g_{i} \leq s(i, j), \quad i \in \llbracket m \rrbracket, j \in \llbracket n \rrbracket \end{aligned} WS,1(P,Q)=f,gmax s.t. i=1∑mgipi+j=1∑nfjqjfj+gi≤s(i,j),i∈[[m]],j∈[[n]] - interpretation
2.3.1 Arbitrary Measures and Kantorovich Duality
- Primal:
W s , 1 ( P , Q ) = min π ∫ Z 1 × Z 2 s ( z 1 , z 2 ) d π ( z 1 , z 2 ) W_{s, 1}(\mathbb{P}, \mathbb{Q})=\min _{\pi} \int_{\mathcal{Z}_{1} \times \mathcal{Z}_{2}} s\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right) \mathrm{d} \pi\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right) Ws,1(P,Q)=πmin∫Z1×Z2s(z1,z2)dπ(z1,z2)
W s , t ( P , Q ) = ( W s t , 1 ( P , Q ) ) 1 / t W_{s, t}(\mathbb{P}, \mathbb{Q})=\left(W_{s^{t}, 1}(\mathbb{P}, \mathbb{Q})\right)^{1 / t} Ws,t(P,Q)=(Wst,1(P,Q))1/t
s t ( z 1 , z 2 ) = ( s ( z 1 , z 2 ) ) t s^{t}\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right)=\left(s\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right)\right)^{t} st(z1,z2)=(s(z1,z2))t - Dual:
W s , 1 ( P , Q ) = sup f , g ∫ Z 1 g ( z 1 ) d P ( z 1 ) + ∫ Z 2 f ( z 2 ) d Q ( z 2 ) s.t. f ( z 2 ) + g ( z 1 ) ≤ s ( z 1 , z 2 ) , z 1 ∈ Z 1 , z 2 ∈ Z 2 , \begin{aligned} W_{s, 1}(\mathbb{P}, \mathbb{Q})=\sup _{f, g} & \int_{\mathcal{Z}_{1}} g\left(\mathbf{z}_{1}\right) d \mathbb{P}\left(\mathbf{z}_{1}\right)+\int_{\mathcal{Z}_{2}} f\left(\mathbf{z}_{2}\right) \mathrm{d} \mathbb{Q}\left(\mathbf{z}_{2}\right) \\ \text { s.t. } & f\left(\mathbf{z}_{2}\right)+g\left(\mathbf{z}_{1}\right) \leq s\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right), \mathbf{z}_{1} \in \mathcal{Z}_{1}, \mathbf{z}_{2} \in \mathcal{Z}_{2}, \end{aligned} Ws,1(P,Q)=f,gsup s.t. ∫Z1g(z1)dP(z1)+∫Z2f(z2)dQ(z2)f(z2)+g(z1)≤s(z1,z2),z1∈Z1,z2∈Z2,
2.4 Some Special Cases
2.5 The Transport Cost Function
- we discuss a number of different scenarios on what may be known regarding the data and the implied appropriate corresponding cost function
- sparse
- dense
- group sparsity
2.5.1 Transport Cost Function via Metric Learning
2.6 Robustness of the Wasserstein Ambiguity Set
- Theorem 2.6.1. Suppose we are given two probability distributions
P
\mathbb{P}
P and
P
o
u
t
\mathbb{P}_{out}
Pout, and the mixture distribution
P
m
i
x
\mathbb{P}_{mix}
Pmix is a convex combination of the two:
P
m
i
x
=
q
P
o
u
t
+
(
1
−
q
)
P
\mathbb{P}_{mix} = q\mathbb{P}_{out} + (1 − q)\mathbb{P}
Pmix=qPout+(1−q)P. Then, for any cost function
s
s
s,
W s , 1 ( P out , P mix ) W s , 1 ( P , P mix ) = 1 − q q \frac{W_{s, 1}\left(\mathbb{P}_{\text {out }}, \mathbb{P}_{\text {mix }}\right)}{W_{s, 1}\left(\mathbb{P}, \mathbb{P}_{\text {mix }}\right)}=\frac{1-q}{q} Ws,1(P,Pmix )Ws,1(Pout ,Pmix )=q1−q - We claim that when q q q is small, if the Wasserstein ball radius ϵ \epsilon ϵ is chosen judiciously, the true distribution P \mathbb{P} P will be included in the ϵ \epsilon ϵ-Wasserstein ball Ω \Omega Ω while the outlying distribution P o u t \mathbb{P}_{out} Pout will be excluded.
2.7 Setting the Radius of the Wasserstein Ball
- In the next two subsections we discuss two practical radius selection approaches