Wasserstein distance in Optimal transport

Wasserstein Distance

Optimal transport

1. Notations

​ Consider two probability measures µ and ν defined on measure spaces X and Y . In most applications X and Y are subsets of R d \mathbb{R}^d Rd and µ and ν have density functions which we denote by I 0 I_0 I0 and I 1 I_1 I1, d μ ( x ) = I 0 ( x ) d x d\mu(x)=I_0(x)dx dμ(x)=I0(x)dx and d v ( x ) = I 1 ( x ) d x dv(x) = I_1(x)dx dv(x)=I1(x)dx, (originally representing the height of a pile of soil/sand and the depth of an excavation).

2. Monge’s formulation

​ Monge’s optimal transportation problem is to find a measurable map f : X Y that pushes µ onto ν and minimizes the following objective function,
M ( μ , v ) = i n f f ∈ M P ∫ x c ( x , f ( x ) ) d μ ( x ) M(\mu,v)=inf_{f\in MP}\int_xc(x,f(x))d\mu(x) M(μ,v)=inffMPxc(x,f(x))dμ(x)
Where c : X × Y → R + c:X\times Y\rightarrow \mathbb{R}^+ c:X×YR+, is the cost functional, and M P = { f : X → Y ∣ f # μ = v } MP=\{f:X\rightarrow Y|f_\#\mu=v\} MP={f:XYf#μ=v} represents the pushforward of measure µ and is characterized as, ∫ f − 1 ( A ) d μ ( x ) = ∫ A d ν ( y ) \int_{f^{-1}(A)} d \mu(x)=\int_{A} d \nu(y) f1(A)dμ(x)=Adν(y) for any measurable A ⊂ Y A\subset Y AY.

在这里插入图片描述

​ Simply put, the Monge formulation of the problem seeks the best pushforward map that rearranges measure µ into measure ν while minimizing a specific cost function.

Drawback:

  1. Nonlinear with respect to f(x)
  2. For certain measures the Monge’s formulation of the optimal transport problem is illposed; in the sense that there is no transport map to rearrange one measure to another. For instance, consider the case where µ is a Dirac mass while ν is not.
3. Kantorovich’s formulation

​ Kantorovich’s formulation alleviates this problem by finding the optimal transport plan as opposed to the transport map. Kantorovich formulated the transportation problem by optimizing over transportation plans, where a transport plan is a probability measure γ ∈ P ( X × Y ) \gamma\in P(X \times Y) γP(X×Y) with marginals µ and ν. The quantity γ ( A , B ) \gamma(A,B) γ(A,B) tells us how much ‘mass’ in set A is being moved to set B. Let Γ ( μ , v ) \Gamma(\mu,v) Γ(μ,v) be the set of all such plans. Kantorovich’s formulation can then be written as,
K ( μ , ν ) = min ⁡ γ ∈ Γ ( μ , ν ) ∫ X × Y c ( x , y ) d γ ( x , y ) K(\mu, \nu)=\min _{\gamma \in \Gamma(\mu, \nu)} \int_{X \times Y} c(x, y) d \gamma(x, y) K(μ,ν)=γΓ(μ,ν)minX×Yc(x,y)dγ(x,y)
Note that unlike the Monge problem, in Kantorovich’s formulation the objective function and the constraints are linear with respect to γ ( x , y ) \gamma (x,y) γ(x,y). Moreover, Kantorovich’s formulation is in the form of a convex optimization problem.

在这里插入图片描述

The Kantorovich problem is especially interesting in a discrete setting, that is for probability measures of the form μ = ∑ i = 1 M p i δ x i \mu=\sum_{i=1}^{M} p_{i} \delta_{x_{i}} μ=i=1Mpiδxi and v = ∑ j = 1 N q j δ x j v=\sum_{j=1}^{N} q_{j} \delta_{x_{j}} v=j=1Nqjδxj, where δ x i \delta_{x_i} δxi is a dirac measure centered at x i x_i xi. the Kantorovich problem can be written as,
K ( μ , ν ) = min ⁡ γ ∑ i ∑ j c ( x i , y j ) γ i j  s.t.  ∑ j γ i j = p i , ∑ i γ i j = q j γ i j ≥ 0 , i = 1 , … , M , j = 1 , … , N \begin{aligned} K(\mu, \nu)=& \min _{\gamma} \sum_{i} \sum_{j} c\left(x_{i}, y_{j}\right) \gamma_{i j} \\ \text { s.t. } & \sum_{j} \gamma_{i j}=p_{i}, \sum_{i} \gamma_{i j}=q_{j} \\ & \gamma_{i j} \geq 0, i=1, \ldots, M, j=1, \ldots, N \end{aligned} K(μ,ν)= s.t. γminijc(xi,yj)γijjγij=pi,iγij=qjγij0,i=1,,M,j=1,,N
where γ i j \gamma_{ij} γij identifies how much of the mass particle m i m_i mi at x i x_i xi needs to be moved to y i y_i yi.

4. Wasserstein Distance

​ Let Ω \Omega Ω be a subset of R d \mathbb{R}^d Rd on which the measures we consider are defined. In most applications Ω \Omega Ω is the domain where the signal is defined and thus bounded. Let P p ( Ω ) P_p(\Omega) Pp(Ω) be the set of Borel probability measures on Ω \Omega Ω, with finite p’th moment, that is the set of probability measures μ \mu μ on R d \mathbb{R}^d Rd such that ∫ Ω ∣ x ∣ p d μ ( x ) < ∞ \int_{\Omega}|x|^{p} d \mu(x)<\infty Ωxpdμ(x)< The p-Wasserstein distance(metric), W p W_p Wp , for p > = 1 p>=1 p>=1 on P p ( Ω ) P_p(\Omega) Pp(Ω) is then defined as using the Kantorovich’s formulation with the cost function c ( x , y ) = ∣ x − y ∣ p c(x,y)=|x-y|^p c(x,y)=xyp, for μ \mu μ and v v v in P p ( Ω ) P_p(\Omega) Pp(Ω) ,
W p ( μ , ν ) = ( inf ⁡ γ ∈ Γ ( μ , ν ) ∫ Ω × Ω ∣ x − y ∣ p d γ ( x , y ) ) 1 p W_{p}(\mu, \nu)=\left(\inf _{\gamma \in \Gamma(\mu, \nu)} \int_{\Omega \times \Omega}|x-y|^{p} d \gamma(x, y)\right)^{\frac{1}{p}} Wp(μ,ν)=(γΓ(μ,ν)infΩ×Ωxypdγ(x,y))p1
For any p ≥ 0 p \ge0 p0, W p W_p Wp is a metric on P p ( Ω ) P_p(\Omega) Pp(Ω).The metric space ( P p ( Ω ) , W p ) (P_p(\Omega),W_p) (Pp(Ω),Wp)is referred to as the p-Wasserstein space.

Note that the p-Wasserstein metric can equivalently be defined using the dual Kantorovich problem,
W p ( μ , ν ) = ( sup ⁡ ϕ { ∫ Ω ϕ ( x ) d μ ( x ) − ∫ Ω ϕ c ( y ) d ν ( y ) } ) 1 p W_{p}(\mu, \nu)=\left(\sup _{\phi}\left\{\int_{\Omega} \phi(x) d \mu(x)-\int_{\Omega} \phi^{c}(y) d \nu(y)\right\}\right)^{\frac{1}{p}} Wp(μ,ν)=(ϕsup{Ωϕ(x)dμ(x)Ωϕc(y)dν(y)})p1
Where ϕ c ( y ) = inf ⁡ x { ϕ ( x ) − ∣ x − y ∣ p } \phi^{c}(y)=\inf _{x}\left\{\phi(x)-|x-y|^{p}\right\} ϕc(y)=infx{ϕ(x)xyp}. For the specific case of p = 1 the p-Wasserstein metricis also known as the Monge–Rubinstein metric, or the earth mover distance.

Advantage:

  1. 能够很自然地度量离散分布和连续分布之间的距离;
  2. 不仅给出了距离的度量,而且给出如何把一个分布变换为另一分布的方案;
  3. 能够连续地把一个分布变换为另一个分布,在此同时,能够保持分布自身的几何形态特征;(Geodesic) and (Barycenters)

Disadvantage:

  1. 计算复杂度高,大多数情况下没有闭式解(除了一维和高斯分布)

The p-Wasserstein metric for one-dimensional probability measures is specifically interesting due to its simple and unique characterization.
W p ( μ , ν ) = ( ∫ X ∣ x , F ν − 1 ( F μ ( x ) ) ∣ p d μ ( x ) ) 1 p = ( ∫ 0 1 ∣ F μ − 1 ( z ) , F ν − 1 ( z ) ∣ p d z ) 1 p \begin{aligned} W_{p}(\mu, \nu) &=\left(\int_{X}|x, F_{\nu}^{-1}\left(F_{\mu}(x)\right)|^p d \mu(x)\right)^{\frac{1}{p}} \\ &=\left(\int_{0}^{1} |F_{\mu}^{-1}(z), F_{\nu}^{-1}(z)|^p d z\right)^{\frac{1}{p}} \end{aligned} Wp(μ,ν)=(Xx,Fν1(Fμ(x))pdμ(x))p1=(01Fμ1(z),Fν1(z)pdz)p1
The closed-form solution of the p-Wasserstein distance in one dimension is an attractive property, as it alleviates the need for optimization. This property was employed in the Sliced Wasserstein distance as defined below.

5. Sliced Wasserstein distance
  1. Radom Transform

在这里插入图片描述

简单来说,Radom Transform 阐述了一幅分布与其在各个角度下投影的具体表示。

img

将分布 f ( x , y ) f(x,y) f(x,y)映射到 f ( α , s ) f(\alpha,s) f(α,s).下图也是一个radom transfrom例子:

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

  1. Sliced Wasserstein distance

    ​ The idea behind the sliced p-Wasserstein distance is to first, obtain a family of one-dimensional representations for a higher-dimensional probability distribution through linear projections (via the Radon transform), and then, calculate the distance between two input distributions as a functional on the p-Wasserstein distance of their one-dimensional representations. The sliced p-Wasserstein distance between I μ I_\mu Iμ and I v I_v Iv is then formally defined as:
    S W p ( I μ , I ν ) = ( ∫ S d − 1 W p p ( R I μ ( . , θ ) , R I ν ( . , θ ) ) d θ ) 1 p S W_{p}\left(I_{\mu},I_{\nu}\right)=\left(\int_{\mathbb{S}^{d-1}} W_{p}^{p}\left(\mathcal{R} I_{\mu}(., \theta), \mathcal{R} I_{\nu}(., \theta)\right) d \theta\right)^{\frac{1}{p}} SWp(Iμ,Iν)=(Sd1Wpp(RIμ(.,θ),RIν(.,θ))dθ)p1
    This is indeed a distance function as it satisfifies positive definiteness, symmetry and the triangle inequality.

  2. Maximum sliced p-Wasserstein (max-SW)
    max ⁡ − S W p ( I μ , I ν ) = max ⁡ θ ∈ S d − 1 W p ( R I μ ( ⋅ , θ ) , R I ν ( ⋅ , θ ) ) \max -S W_{p}\left(I_{\mu}, I_{\nu}\right)=\max _{\theta \in \mathbb{S}^{d-1}} W_{p}\left(\mathcal{R} I_{\mu}(\cdot, \theta), \mathcal{R} I_{\nu}(\cdot, \theta)\right) maxSWp(Iμ,Iν)=θSd1maxWp(RIμ(,θ),RIν(,θ))

6. Generalized Sliced-Wasserstein Distances

​ The GSW distance is obtained using the same procedure as for the SW distance, except that here, the one-dimensional representations are acquired through nonlinear projections.

  1. Generalized Radon Transform
    在这里插入图片描述
    G I ( t , θ ) = ∫ R d I ( x ) δ ( t − g ( x , θ ) ) d x \mathcal{G} I(t, \theta)=\int_{\mathbb{R}^{d}} I(x) \delta(t-g(x, \theta)) d x GI(t,θ)=RdI(x)δ(tg(x,θ))dx

  2. Generalized Sliced-Wasserstein and Maximum Generalized Sliced-Wasserstein Distances

    ​ Following the defifinition of the SW distance, we define the generalized sliced p-Wasserstein distance using the generalized Radon transform as:
    G S W p ( I μ , I ν ) = ( ∫ Ω θ W p p ( G I μ ( ⋅ , θ ) , G I ν ( ⋅ , θ ) ) d θ ) 1 p G S W_{p}\left(I_{\mu}, I_{\nu}\right)=\left(\int_{\Omega_{\theta}} W_{p}^{p}\left(\mathcal{G} I_{\mu}(\cdot, \theta), \mathcal{G} I_{\nu}(\cdot, \theta)\right) d \theta\right)^{\frac{1}{p}} GSWp(Iμ,Iν)=(ΩθWpp(GIμ(,θ),GIν(,θ))dθ)p1
    Maximum Generalized Sliced-Wasserstein Distances:
    max ⁡ − G S W p ( I μ , I ν ) = max ⁡ θ ∈ Ω θ W p ( G I μ ( ⋅ , θ ) , G ν ( ⋅ , θ ) ) \max -G S W_{p}\left(I_{\mu}, I_{\nu}\right)=\max _{\theta \in \Omega_{\theta}} W_{p}\left(\mathcal{G} I_{\mu}(\cdot, \theta), \mathcal{G}_{\nu}(\cdot, \theta)\right) maxGSWp(Iμ,Iν)=θΩθmaxWp(GIμ(,θ),Gν(,θ))

  3. Algorithm
    在这里插入图片描述
    在这里插入图片描述

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值