1. 概率密度函数
关于 Bessel Function 请参阅《贝塞尔函数-Wikipedia》《Modified Bessel Function of the First Kind》。
小结: 集中参数 κ ≥ 0 \kappa \ge 0 κ≥0 越大, 方向的分布越集中于平均方向 μ \bm{\mu} μ. 极限地, 当 κ = 0 \kappa = 0 κ=0 时, 方向地均匀分布; 当 κ → ∞ \kappa \rightarrow \infin κ→∞ 时, 变成 μ \bm{\mu} μ 处的狄拉克分布.
2. Relation to Normal Distribution
即,将协方差矩阵为
κ
−
1
I
\kappa^{-1}\bm{I}
κ−1I、均值为
μ
(
∥
μ
∥
=
r
>
0
)
\bm{\mu}(\|\bm{\mu}\|=r>0)
μ(∥μ∥=r>0) 的
p
p
p 元高斯分布的 support 限制在超球 (
∥
x
∥
=
1
\|\bm{x}\|=1
∥x∥=1) 上,则变为 von Mises-Fisher Distribution (vMF)。
进一步的推导:
G
p
(
x
;
μ
,
κ
)
=
(
κ
2
π
)
p
e
x
p
(
−
κ
(
x
−
μ
)
⊺
(
x
−
μ
)
2
)
=
(
κ
2
π
)
p
e
x
p
(
−
κ
x
⊺
x
+
μ
⊺
μ
−
2
μ
⊺
x
2
)
=
(
κ
2
π
)
p
e
x
p
(
−
κ
1
+
r
2
−
2
μ
⊺
x
2
)
=
(
κ
2
π
)
p
e
x
p
(
−
κ
(
1
+
r
2
)
2
+
κ
μ
⊺
x
)
=
[
(
κ
2
π
)
p
e
x
p
(
−
κ
(
1
+
r
2
)
2
)
]
e
x
p
(
κ
r
μ
⊺
r
x
)
\begin{aligned} G_p(\bm{x}; \bm{\mu}, \kappa) &= \left(\sqrt{\frac{\kappa}{2\pi}}\right)^p exp\left(-\kappa \frac{(\bm{x}-\bm{\mu})^\intercal(\bm{x}-\bm{\mu})}{2}\right) \\ &= \left(\sqrt{\frac{\kappa}{2\pi}}\right)^p exp\left(-\kappa \frac{ \bm{x}^\intercal\bm{x} + \bm{\mu}^\intercal\bm{\mu} - 2\bm{\mu}^\intercal\bm{x}}{2} \right) \\ &= \left(\sqrt{\frac{\kappa}{2\pi}}\right)^p exp\left(-\kappa \frac{1 + r^2 - 2\bm{\mu}^\intercal\bm{x}}{2} \right) \\ &= \left(\sqrt{\frac{\kappa}{2\pi}}\right)^p exp\left(\frac{-\kappa(1 + r^2)}{2} + \kappa \bm{\mu}^\intercal\bm{x} \right) \\ &= \left[ \left(\sqrt{\frac{\kappa}{2\pi}}\right)^p exp\left(\frac{-\kappa(1 + r^2)}{2} \right) \right] exp\left(\kappa r \frac{\bm{\mu}^\intercal}{r}\bm{x} \right) \end{aligned}
Gp(x;μ,κ)=(2πκ)pexp(−κ2(x−μ)⊺(x−μ))=(2πκ)pexp(−κ2x⊺x+μ⊺μ−2μ⊺x)=(2πκ)pexp(−κ21+r2−2μ⊺x)=(2πκ)pexp(2−κ(1+r2)+κμ⊺x)=[(2πκ)pexp(2−κ(1+r2))]exp(κrrμ⊺x) 【注】限制
∥
x
∥
=
1
\|\bm{x}\|=1
∥x∥=1 后,上式的归一化系数要重新计算:在超球上进行积分。得到的结果为
f
p
(
x
;
r
−
1
μ
;
r
κ
)
f_p(\bm{x}; r^{-1}\mu; r\kappa)
fp(x;r−1μ;rκ)。
上图是可视化的二维平面上, 以
μ
=
(
1
2
,
1
2
)
\bm{\mu}=(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})
μ=(21,21) 为均值、以
Σ
=
[
1
0
0
1
]
\Sigma = \begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix}
Σ=[1001] 为标准差的正态分布, 如果限制分布支撑集为单位圆, 再重新计算归一化系数, 则成了 von-Mises 分布(vMF的二维情形).
小结:von Mises-Fisher Distribution 本质是超球上的正态分布,且各向同性 (isotropic),协方差矩阵为 κ − 1 I \kappa^{-1}\bm{I} κ−1I。
3. Estimating the Parameters of the vMF Distribution
根据 Wikipedia, 我们先把 vMF 的概率密度函数表示为: f p ( x ; μ , κ ) = C p ( κ ) e x p ( κ μ ⊺ x ) C p ( κ ) = κ p 2 − 1 ( 2 π ) p 2 I p 2 − 1 ( κ ) \begin{aligned} f_p(\bm{x}; \bm{\mu}, \kappa) =& C_p(\kappa) exp(\kappa \bm{\mu}^\intercal \bm{x}) \\ C_p(\kappa) =& \frac{\kappa^{\frac{p}{2}-1}}{(2\pi)^{\frac{p}{2}}I_{\frac{p}{2}-1}(\kappa)} \end{aligned} fp(x;μ,κ)=Cp(κ)=Cp(κ)exp(κμ⊺x)(2π)2pI2p−1(κ)κ2p−1 给定一堆样本 { x 1 , x 2 , ⋯ , x n } \{\bm{x}_1, \bm{x}_2, \cdots, \bm{x}_n\} {x1,x2,⋯,xn}, 独立地采自某 vMF, 现要对其参数 ( κ , μ ) (\kappa, \bm{\mu}) (κ,μ) 进行最大似然估计: L ( μ ; p , κ , λ ) = n l o g ( C p ( κ ) ) + κ μ ⊺ ∑ i = 1 n x i + λ ( ∥ μ ∥ − 1 ) \begin{aligned} L(\bm{\mu}; p, \kappa, \lambda) =& nlog(C_p(\kappa)) + \kappa\bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i + \lambda(\|\bm{\mu}\| - 1) \end{aligned} L(μ;p,κ,λ)=nlog(Cp(κ))+κμ⊺i=1∑nxi+λ(∥μ∥−1) 求偏导: ∂ L ∂ μ = κ ∑ i = 1 n x i + λ μ ∥ μ ∥ = 0 ∂ L ∂ κ = n C p ′ ( κ ) C p ( κ ) + μ ⊺ ∑ i = 1 n x i = 0 ∂ L ∂ λ = ∥ μ ∥ − 1 = 0 \begin{aligned} \ \frac{\partial L}{\partial \bm{\mu}} =& \kappa \sum_{i=1}^n\bm{x}_i + \lambda \frac{\bm{\mu}}{\|\bm{\mu}\|} &= 0 \\ \frac{\partial L}{\partial \kappa} =& n \frac{C_p^{'}(\kappa)}{C_p(\kappa)} + \bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i &= 0 \\ \frac{\partial L}{\partial \lambda} =& \|\bm{\mu}\| - 1 &= 0 \end{aligned} ∂μ∂L=∂κ∂L=∂λ∂L=κi=1∑nxi+λ∥μ∥μnCp(κ)Cp′(κ)+μ⊺i=1∑nxi∥μ∥−1=0=0=0 先看其中比较特别的 C p ′ ( κ ) C p ( κ ) \frac{C_p^{'}(\kappa)}{C_p(\kappa)} Cp(κ)Cp′(κ): C p ( κ ) = κ p 2 − 1 ( 2 π ) p 2 I p 2 − 1 ( κ ) = ( 2 π ) − p 2 κ p 2 − 1 I p 2 − 1 ( κ ) C p ′ ( κ ) = ( 2 π ) − p 2 ( p 2 − 1 ) κ p 2 − 2 I p 2 − 1 ( κ ) − κ p 2 − 1 I p 2 − 1 ′ ( κ ) I p 2 − 1 2 ( κ ) C p ′ ( κ ) C p ( κ ) = ( 2 π ) − p 2 ( p 2 − 1 ) κ p 2 − 2 I p 2 − 1 ( κ ) − κ p 2 − 1 I p 2 − 1 ′ ( κ ) I p 2 − 1 2 ( κ ) ( 2 π ) − p 2 κ p 2 − 1 I p 2 − 1 ( κ ) = ( p 2 − 1 ) κ p 2 − 2 I p 2 − 1 ( κ ) − κ p 2 − 1 I p 2 − 1 ′ ( κ ) κ p 2 − 1 I p 2 − 1 ( κ ) = ( p 2 − 1 ) κ − 1 I p 2 − 1 ( κ ) − I p 2 − 1 ′ ( κ ) I p 2 − 1 ( κ ) = ( p 2 − 1 ) κ − 1 I p 2 − 1 ( κ ) − [ I p 2 ( κ ) + ( p 2 − 1 ) κ − 1 I p 2 − 1 ( κ ) ] I p 2 − 1 ( κ ) = − I p 2 ( κ ) I p 2 − 1 ( κ ) < 0 [ ∵ I α ( x ) ≥ 0 ] \begin{aligned} C_p(\kappa) &= \frac{\kappa^{\frac{p}{2}-1}}{(2\pi)^{\frac{p}{2}} I_{\frac{p}{2}-1}(\kappa)} = (2\pi)^{-\frac{p}{2}}\frac{\kappa^{\frac{p}{2}-1}}{I_{\frac{p}{2}-1}(\kappa)} \\ C_p^{'}(\kappa) &= (2\pi)^{-\frac{p}{2}} \frac{(\frac{p}{2}-1)\kappa^{\frac{p}{2}-2} I_{\frac{p}{2}-1}(\kappa) - \kappa^{\frac{p}{2}-1} I_{\frac{p}{2}-1}^{'}(\kappa)}{I_{\frac{p}{2}-1}^2(\kappa)} \\ \frac{C_p^{'}(\kappa)}{C_p(\kappa)} &= \frac{ (2\pi)^{-\frac{p}{2}} \frac{(\frac{p}{2}-1)\kappa^{\frac{p}{2}-2} I_{\frac{p}{2}-1}(\kappa) - \kappa^{\frac{p}{2}-1} I_{\frac{p}{2}-1}^{'}(\kappa)}{I_{\frac{p}{2}-1}^2(\kappa)} }{ (2\pi)^{-\frac{p}{2}}\frac{\kappa^{\frac{p}{2}-1}}{I_{\frac{p}{2}-1}(\kappa)} } \\ &= \frac{ (\frac{p}{2}-1)\kappa^{\frac{p}{2}-2} I_{\frac{p}{2}-1}(\kappa) - \kappa^{\frac{p}{2}-1} I_{\frac{p}{2}-1}^{'}(\kappa) }{\kappa^{\frac{p}{2}-1} I_{\frac{p}{2}-1}(\kappa)} \\ &= \frac{ (\frac{p}{2}-1)\kappa^{-1} I_{\frac{p}{2}-1}(\kappa) - I_{\frac{p}{2}-1}^{'}(\kappa) }{I_{\frac{p}{2}-1}(\kappa)} \\ &= \frac{ (\frac{p}{2}-1)\kappa^{-1} I_{\frac{p}{2}-1}(\kappa) - [I_{\frac{p}{2}}(\kappa) + (\frac{p}{2}-1)\kappa^{-1} I_{\frac{p}{2}-1}(\kappa)] }{I_{\frac{p}{2}-1}(\kappa)} \\ &= -\frac{I_{\frac{p}{2}}(\kappa)}{I_{\frac{p}{2}-1}(\kappa)} \lt 0 ~~~~~ [\because I_{\alpha}(x) \ge 0] \\ \end{aligned} Cp(κ)Cp′(κ)Cp(κ)Cp′(κ)=(2π)2pI2p−1(κ)κ2p−1=(2π)−2pI2p−1(κ)κ2p−1=(2π)−2pI2p−12(κ)(2p−1)κ2p−2I2p−1(κ)−κ2p−1I2p−1′(κ)=(2π)−2pI2p−1(κ)κ2p−1(2π)−2pI2p−12(κ)(2p−1)κ2p−2I2p−1(κ)−κ2p−1I2p−1′(κ)=κ2p−1I2p−1(κ)(2p−1)κ2p−2I2p−1(κ)−κ2p−1I2p−1′(κ)=I2p−1(κ)(2p−1)κ−1I2p−1(κ)−I2p−1′(κ)=I2p−1(κ)(2p−1)κ−1I2p−1(κ)−[I2p(κ)+(2p−1)κ−1I2p−1(κ)]=−I2p−1(κ)I2p(κ)<0 [∵Iα(x)≥0] 记 A p ( κ ) = − C p ′ ( κ ) C p ( κ ) = I p 2 ( κ ) I p 2 − 1 ( κ ) A_p(\kappa) = -\frac{C_p^{'}(\kappa)}{C_p(\kappa)} = \frac{I_{\frac{p}{2}}(\kappa)}{I_{\frac{p}{2}-1}(\kappa)} Ap(κ)=−Cp(κ)Cp′(κ)=I2p−1(κ)I2p(κ). 想办法消去前两式中的 μ \bm{\mu} μ: ( κ ∑ i = 1 n x i ) ⊺ ( κ ∑ i = 1 n x i ) = ( − λ μ ) ⊺ ( − λ μ ) κ 2 ( ∑ i = 1 n x i ) ⊺ ( ∑ i = 1 n x i ) = λ 2 ∥ ∑ i = 1 n x i ∥ = ∣ λ ∣ κ [ λ 正负未知 ] (1) \begin{aligned} (\kappa \sum_{i=1}^n\bm{x}_i)^\intercal (\kappa \sum_{i=1}^n\bm{x}_i) &= (-\lambda \bm{\mu})^\intercal (-\lambda \bm{\mu}) \\ \kappa^2(\sum_{i=1}^n\bm{x}_i)^\intercal (\sum_{i=1}^n\bm{x}_i) &= \lambda^2 \\ \left\|\sum_{i=1}^n\bm{x}_i\right\| &= \frac{\left|\lambda\right|}{\kappa} & [\lambda 正负未知] \tag{1} \end{aligned} (κi=1∑nxi)⊺(κi=1∑nxi)κ2(i=1∑nxi)⊺(i=1∑nxi) i=1∑nxi =(−λμ)⊺(−λμ)=λ2=κ∣λ∣[λ正负未知](1) κ μ ⊺ ∑ i = 1 n x i + λ μ ⊺ μ = κ μ ⊺ ∑ i = 1 n x i + λ = 0 n κ C p ′ ( κ ) C p ( κ ) + κ μ ⊺ ∑ i = 1 n x i = κ μ ⊺ ∑ i = 1 n x i + n κ C p ′ ( κ ) C p ( κ ) = 0 n κ C p ′ ( κ ) C p ( κ ) = λ < 0 [ ∵ C p ′ ( κ ) C p ( κ ) < 0 ] (2) \begin{aligned} \kappa \bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i + \lambda \bm{\mu}^\intercal \bm{\mu} &= \kappa \bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i + \lambda = 0 \\ n\kappa\frac{C_p^{'}(\kappa)}{C_p(\kappa)} + \kappa \bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i &= \kappa \bm{\mu}^\intercal \sum_{i=1}^n\bm{x}_i + n\kappa\frac{C_p^{'}(\kappa)}{C_p(\kappa)} = 0 \\ n\kappa\frac{C_p^{'}(\kappa)}{C_p(\kappa)} &= \lambda \lt 0 & [\because \frac{C_p^{'}(\kappa)}{C_p(\kappa)} \lt 0] \tag{2} \end{aligned} κμ⊺i=1∑nxi+λμ⊺μnκCp(κ)Cp′(κ)+κμ⊺i=1∑nxinκCp(κ)Cp′(κ)=κμ⊺i=1∑nxi+λ=0=κμ⊺i=1∑nxi+nκCp(κ)Cp′(κ)=0=λ<0[∵Cp(κ)Cp′(κ)<0](2) 现在可以去掉 ( 1 ) (1) (1) 中的绝对值符号了: ∥ ∑ i = 1 n x i ∥ = − λ κ = − n C p ′ ( κ ) C p ( κ ) = n A p ( κ ) ⟹ A p ( κ ) = 1 n ∥ ∑ i = 1 n x i ∥ = ∥ x ˉ ∥ \begin{aligned} \left\|\sum_{i=1}^n\bm{x}_i\right\| &= -\frac{\lambda}{\kappa} = -n \frac{C_p^{'}(\kappa)}{C_p(\kappa)} = nA_p(\kappa) \\ \Longrightarrow A_p(\kappa) &= \frac{1}{n} \left\|\sum_{i=1}^n\bm{x}_i\right\| \\ &= \|\bar{\bm{x}}\| \\ \end{aligned} i=1∑nxi ⟹Ap(κ)=−κλ=−nCp(κ)Cp′(κ)=nAp(κ)=n1 i=1∑nxi =∥xˉ∥ 然而, 关于 κ \kappa κ 的方程 A p ( κ ) = ∥ x ˉ ∥ A_p(\kappa) = \|\bar{\bm{x}}\| Ap(κ)=∥xˉ∥ 的求解是相当麻烦的, 一些近似求解方法请参考 Wikipedia. 假装已经求出了 κ \kappa κ, 然后可以求 λ \lambda λ: λ = − n κ A p ( κ ) = − n κ ∥ x ˉ ∥ \begin{aligned} \lambda &= -n\kappa A_p(\kappa) = -n\kappa \|\bar{\bm{x}}\| \end{aligned} λ=−nκAp(κ)=−nκ∥xˉ∥ 继续可以求 μ \bm{\mu} μ: κ ∑ i = 1 n x i + λ μ ∥ μ ∥ = 0 ⟶ μ = − κ λ ∑ i = 1 n x i = 1 ∥ ∑ i = 1 n x i ∥ ∑ i = 1 n x i [ μ 的估计 ] = 1 n ∥ x ˉ ∥ ∑ i = 1 n x i = 1 A p ( κ ) 1 n ∑ i = 1 n x i ⟶ m e a n = 1 n ∑ i = 1 n x i = A p ( κ ) μ \begin{aligned} & \kappa \sum_{i=1}^n\bm{x}_i + \lambda \frac{\bm{\mu}}{\|\bm{\mu}\|} = 0 \\ \longrightarrow \bm{\mu} =& -\frac{\kappa}{\lambda} \sum_{i=1}^n\bm{x}_i \\ =& \frac{1}{\|\sum_{i=1}^n\bm{x}_i\|} \sum_{i=1}^n\bm{x}_i & [\bm{\mu} ~ 的估计] \\ =& \frac{1}{n\|\bar{\bm{x}}\|} \sum_{i=1}^n\bm{x}_i \\ =& \frac{1}{A_p(\kappa)} \frac{1}{n}\sum_{i=1}^n\bm{x}_i \\ \longrightarrow mean = \frac{1}{n}\sum_{i=1}^n\bm{x}_i=& A_p(\kappa)\bm{\mu} \end{aligned} ⟶μ====⟶mean=n1i=1∑nxi=κi=1∑nxi+λ∥μ∥μ=0−λκi=1∑nxi∥∑i=1nxi∥1i=1∑nxin∥xˉ∥1i=1∑nxiAp(κ)1n1i=1∑nxiAp(κ)μ[μ 的估计] 即, 均值(注意不是平均方向)是: m e a n = 1 n ∑ i = 1 n x i = A p ( κ ) μ mean = \frac{1}{n}\sum_{i=1}^n\bm{x}_i = A_p(\kappa)\bm{\mu} mean=n1∑i=1nxi=Ap(κ)μ.
小结: vMF 参数的最大似然估计是: { μ = x ˉ ∥ x ˉ ∥ κ = A p − 1 ( ∥ x ˉ ∥ ) \begin{cases} \bm{\mu} = \frac{\bar{\bm{x}}}{\|\bar{\bm{x}}\|} \\ \kappa = A_p^{-1}(\|\bar{\bm{x}}\|) \end{cases} {μ=∥xˉ∥xˉκ=Ap−1(∥xˉ∥), 期望值(expected value) 为 A p ( κ ) μ A_p(\kappa)\bm{\mu} Ap(κ)μ.
4. Entropy and KL Divergence
分布 v M F ( μ , κ ) vMF(\bm{\mu}, \kappa) vMF(μ,κ) 的熵为:
⟨ − l o g f p ( x ; μ , κ ) ⟩ x ∼ v M F ( μ , κ ) = − ∫ 球 f p ( x ; μ , κ ) l o g ( C p ( κ ) e x p ( κ μ ⊺ x ) ) d x = − ∫ 球 f p ( x ; μ , κ ) ( l o g C p ( κ ) + κ μ ⊺ x ) d x = − l o g C p ( κ ) ∫ 球 f p ( x ; μ , κ ) d x − κ μ ⊺ ∫ 球 x f p ( x ; μ , κ ) d x = − l o g C p ( κ ) − κ μ ⊺ A p ( κ ) μ ∵ m e a n = A p ( κ ) μ = − l o g f p ( A p ( κ ) μ , μ , κ ) = − l o g C p ( κ ) − κ A p ( κ ) \begin{aligned} \langle -logf_p(\bm{x};\bm{\mu},\kappa) \rangle_{\bm{x}\sim vMF(\bm{\mu},\kappa)} &= -\int_{球} f_p(\bm{x};\bm{\mu},\kappa) log\left( C_p(\kappa)exp(\kappa \bm{\mu}^\intercal \bm{x}) \right) d\bm{x} \\ &= -\int_{球} f_p(\bm{x};\bm{\mu},\kappa) \left(log C_p(\kappa) + \kappa \bm{\mu}^\intercal \bm{x} \right) d\bm{x} \\ &= -log C_p(\kappa) \int_{球} f_p(\bm{x};\bm{\mu},\kappa) d\bm{x} -\kappa \bm{\mu}^\intercal \int_{球} \bm{x} f_p(\bm{x};\bm{\mu},\kappa) d\bm{x} \\ &= -log C_p(\kappa) -\kappa \bm{\mu}^\intercal A_p(\kappa) \bm{\mu} ~~~ \because mean=A_p(\kappa) \bm{\mu} \\ &= -log f_p(A_p(\kappa)\bm{\mu}, \bm{\mu}, \kappa) \\ &= -log C_p(\kappa) -\kappa A_p(\kappa) \end{aligned} ⟨−logfp(x;μ,κ)⟩x∼vMF(μ,κ)=−∫球fp(x;μ,κ)log(Cp(κ)exp(κμ⊺x))dx=−∫球fp(x;μ,κ)(logCp(κ)+κμ⊺x)dx=−logCp(κ)∫球fp(x;μ,κ)dx−κμ⊺∫球xfp(x;μ,κ)dx=−logCp(κ)−κμ⊺Ap(κ)μ ∵mean=Ap(κ)μ=−logfp(Ap(κ)μ,μ,κ)=−logCp(κ)−κAp(κ) 过程中使用了 m e a n = ∫ 球 x f p ( x ; μ , κ ) d x = A p ( κ ) μ mean = \int_{球} \bm{x} f_p(\bm{x};\bm{\mu},\kappa) d\bm{x} = A_p(\kappa) \bm{\mu} mean=∫球xfp(x;μ,κ)dx=Ap(κ)μ 是前面讨论参数的最大似然估计时的副产物.
分布 v M F ( μ 0 , κ 0 ) vMF(\bm{\mu}_0, \kappa_0) vMF(μ0,κ0) 和 分布 v M F ( μ 1 , κ 1 ) vMF(\bm{\mu}_1, \kappa_1) vMF(μ1,κ1) 之间的 KL 散度为:
⟨ l o g f p ( x ; μ 0 , κ 0 ) f p ( x ; μ , κ 1 ) ⟩ x ∼ v M F ( μ 0 , κ 0 ) = ∫ 球 f p ( x ; μ 0 , κ 0 ) l o g C p ( κ 0 ) e x p ( κ 0 μ 0 ⊺ x ) C p ( κ 1 ) e x p ( κ 1 μ 1 ⊺ x ) d x = ∫ 球 f p ( x ; μ 0 , κ 0 ) ( l o g C p ( κ 0 ) + κ 0 μ 0 ⊺ x ) d x − ∫ 球 f p ( x ; μ 0 , κ 0 ) ( l o g C p ( κ 1 ) + κ 1 μ 1 ⊺ x ) d x = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) − l o g C p ( κ 1 ) − κ 1 μ 1 ⊺ ∫ 球 x f p ( x ; μ 0 , κ 0 ) d x = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) − l o g C p ( κ 1 ) − κ 1 μ 1 ⊺ A p ( κ 0 ) μ 0 = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) − l o g [ C p ( κ 1 ) e x p ( κ 1 μ 1 ⊺ A p ( κ 0 ) μ 0 ) ] = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) − l o g f p ( A p ( κ 0 ) μ 0 , μ 1 , κ 1 ) = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) f p ( A p ( κ 0 ) μ 0 , μ 1 , κ 1 ) \begin{aligned} \left\langle log\frac{ f_p(\bm{x};\bm{\mu}_0,\kappa_0) }{ f_p(\bm{x};\bm{\mu},\kappa_1) } \right\rangle_{\bm{x}\sim vMF(\bm{\mu}_0, \kappa_0)} &= \int_{球} f_p(\bm{x};\bm{\mu}_0,\kappa_0) log\frac{ C_p(\kappa_0)exp(\kappa_0 \bm{\mu}_0^\intercal \bm{x}) }{ C_p(\kappa_1)exp(\kappa_1 \bm{\mu}_1^\intercal \bm{x}) } d\bm{x} \\ &= \int_{球} f_p(\bm{x};\bm{\mu}_0,\kappa_0) (log C_p(\kappa_0) + \kappa_0 \bm{\mu}_0^\intercal \bm{x}) d\bm{x} -\int_{球} f_p(\bm{x};\bm{\mu}_0,\kappa_0) (log C_p(\kappa_1) + \kappa_1 \bm{\mu}_1^\intercal \bm{x}) d\bm{x} \\ &= log f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) -log C_p(\kappa_1) - \kappa_1 \bm{\mu}_1^\intercal \int_{球} \bm{x} f_p(\bm{x};\bm{\mu}_0,\kappa_0) d\bm{x} \\ &= log f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) -log C_p(\kappa_1) - \kappa_1 \bm{\mu}_1^\intercal A_p(\kappa_0) \bm{\mu}_0 \\ &= log f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) -log[C_p(\kappa_1) exp(\kappa_1 \bm{\mu}_1^\intercal A_p(\kappa_0) \bm{\mu}_0)] \\ &= log f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) -log f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_1, \kappa_1) \\ &= log \frac{ f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) }{ f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_1, \kappa_1) } \end{aligned} ⟨logfp(x;μ,κ1)fp(x;μ0,κ0)⟩x∼vMF(μ0,κ0)=∫球fp(x;μ0,κ0)logCp(κ1)exp(κ1μ1⊺x)Cp(κ0)exp(κ0μ0⊺x)dx=∫球fp(x;μ0,κ0)(logCp(κ0)+κ0μ0⊺x)dx−∫球fp(x;μ0,κ0)(logCp(κ1)+κ1μ1⊺x)dx=logfp(Ap(κ0)μ0,μ0,κ0)−logCp(κ1)−κ1μ1⊺∫球xfp(x;μ0,κ0)dx=logfp(Ap(κ0)μ0,μ0,κ0)−logCp(κ1)−κ1μ1⊺Ap(κ0)μ0=logfp(Ap(κ0)μ0,μ0,κ0)−log[Cp(κ1)exp(κ1μ1⊺Ap(κ0)μ0)]=logfp(Ap(κ0)μ0,μ0,κ0)−logfp(Ap(κ0)μ0,μ1,κ1)=logfp(Ap(κ0)μ0,μ1,κ1)fp(Ap(κ0)μ0,μ0,κ0) 小结: { ⟨ − l o g f p ( x ; μ , κ ) ⟩ x ∼ v M F ( μ , κ ) = − l o g f p ( A p ( κ ) μ , μ , κ ) ⟨ l o g f p ( x ; μ 0 , κ 0 ) f p ( x ; μ , κ 1 ) ⟩ x ∼ v M F ( μ 0 , κ 0 ) = l o g f p ( A p ( κ 0 ) μ 0 , μ 0 , κ 0 ) f p ( A p ( κ 0 ) μ 0 , μ 1 , κ 1 ) \begin{cases} \langle -logf_p(\bm{x};\bm{\mu},\kappa) \rangle_{\bm{x}\sim vMF(\bm{\mu},\kappa)} = -log f_p(A_p(\kappa)\bm{\mu}, \bm{\mu}, \kappa) \\ \left\langle log\frac{ f_p(\bm{x};\bm{\mu}_0,\kappa_0) }{ f_p(\bm{x};\bm{\mu},\kappa_1) } \right\rangle_{\bm{x}\sim vMF(\bm{\mu}_0, \kappa_0)} = log \frac{ f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_0, \kappa_0) }{ f_p(A_p(\kappa_0)\bm{\mu}_0, \bm{\mu}_1, \kappa_1) } \end{cases} ⎩ ⎨ ⎧⟨−logfp(x;μ,κ)⟩x∼vMF(μ,κ)=−logfp(Ap(κ)μ,μ,κ)⟨logfp(x;μ,κ1)fp(x;μ0,κ0)⟩x∼vMF(μ0,κ0)=logfp(Ap(κ0)μ0,μ1,κ1)fp(Ap(κ0)μ0,μ0,κ0)
5. 采样
直接对
v
M
F
vMF
vMF 分布进行采样几乎是不可能的. 先看 Wikipedia 怎么说:
大概意思是说, 使用一个叫"radial-tangential decomposition"的东西, 它将
x
\bm{x}
x 分解为
x
=
t
μ
+
1
−
t
2
v
\bm{x} = t\bm{\mu} + \sqrt{1-t^2}\bm{v}
x=tμ+1−t2v, 其中
v
∈
R
p
\bm{v} \in \mathbb{R}^p
v∈Rp 位于
(
p
−
2
)
(p-2)
(p−2) 维的单位切子球上, 这个单位切子球以
μ
\bm{\mu}
μ 为中心, 且垂直于
μ
\bm{\mu}
μ. 从
[
−
1
,
1
]
[-1,1]
[−1,1] 按概率密度
f
r
a
d
i
a
l
(
t
;
κ
,
p
)
f_{radial}(t;\kappa,p)
fradial(t;κ,p) 取一个
t
t
t、切子球上均匀地取一个
v
\bm{v}
v, 就完成了
x
\bm{x}
x 的采样.
这太过抽象, 让人一头雾水, 好在从 Fast Python Sampler of the von Mises Fisher Distribution 中找到了一张图:
所谓的切子球就是图中绿色圆圈和球相交的圆, 蓝色的向量为
t
μ
t\bm{\mu}
tμ, 黄色的向量为
1
−
t
2
v
\sqrt{1-t^2}\bm{v}
1−t2v. 可见
v
\bm{v}
v 确实在切子球上的均匀分布. 若想对
v
M
F
vMF
vMF 分布采样的话, 主要在于
t
t
t 的采样和相应的切子球上的均匀采样.
根据 Wikipedia, t t t 的概率密度函数是: f r a d i a l ( t ; κ , p ) = ( κ / 2 ) ν Γ ( 1 2 ) Γ ( ν + 1 2 ) I ν ( κ ) ( 1 − t 2 ) ν − 1 2 e x p ( κ t ) ν = p 2 − 1 = ( κ / 2 ) p 2 − 1 π Γ ( p − 1 2 ) I p 2 − 1 ( κ ) ( 1 − t 2 ) p − 3 2 e x p ( κ t ) \begin{aligned} f_{radial}(t; \kappa, p) &= \frac{(\kappa/2)^\nu}{\Gamma({\frac{1}{2}})\Gamma(\nu+{\frac{1}{2}})I_{\nu}(\kappa)} (1-t^2)^{\nu-\frac{1}{2}} exp(\kappa t) & \nu = \frac{p}{2}-1 \\ &= \frac{(\kappa/2)^{\frac{p}{2}-1}}{\sqrt{\pi}\Gamma(\frac{p-1}{2})I_{\frac{p}{2}-1}(\kappa)} (1-t^2)^{\frac{p-3}{2}} exp(\kappa t) \end{aligned} fradial(t;κ,p)=Γ(21)Γ(ν+21)Iν(κ)(κ/2)ν(1−t2)ν−21exp(κt)=πΓ(2p−1)I2p−1(κ)(κ/2)2p−1(1−t2)2p−3exp(κt)ν=2p−1 比较复杂, 对于 t t t 的采样, 下面会单独详细地讲解, 这里暂时先假设已经采样到 t t t 样本了.
然而, 对于一般的 μ \bm{\mu} μ, 式子 x = t μ + 1 − t 2 v \bm{x} = t\bm{\mu} + \sqrt{1-t^2}\bm{v} x=tμ+1−t2v 的计算并不容易, 因为 v \bm{v} v 的采样比较困难. 但当 μ = e 1 = ( 1 , 0 , ⋯ , 0 ) \bm{\mu}= \bm{e}_1 =(1, 0, \cdots, 0) μ=e1=(1,0,⋯,0) 时:
- 按概率密度 f r a d i a l ( t ; κ , p ) f_{radial}(t; \kappa, p) fradial(t;κ,p) 采样一个 t ∈ [ − 1 , 1 ] t \in [-1,1] t∈[−1,1];
- 从 ( p − 1 ) (p-1) (p−1) 元标准正太分布采样一个 v \bm{v} v, 并归一化, 就是切子球上的均匀分布的子向量;
- x = [ t ∣ 1 − t 2 v ] \bm{x} = [t|\sqrt{1-t^2}\bm{v}] x=[t∣1−t2v] 为所求采样, 其中 ∣ | ∣ 是拼接.(注意 v \bm{v} v 比 x \bm{x} x 少了一个维度)
然后, 我们可以借助 Householder Transform - Wikipedia 将样本
x
\bm{x}
x 变换到真正想要的
v
M
F
(
μ
,
κ
)
vMF(\bm{\mu}, \kappa)
vMF(μ,κ) 分布. 具体地:
P
v
=
e
1
−
μ
∥
e
1
−
μ
∥
x
=
x
−
2
P
v
⊺
x
P
v
⊺
\begin{aligned} P_v &= \frac{\bm{e}_1 - \bm{\mu}}{\|\bm{e}_1 - \bm{\mu}\|} \\ \bm{x} &= \bm{x} - 2P_v^\intercal\bm{x}P_v^\intercal \end{aligned}
Pvx=∥e1−μ∥e1−μ=x−2Pv⊺xPv⊺ 就是真正想要的样本了. 这相当于将
e
1
\bm{e}_1
e1 转向了
μ
\bm{\mu}
μ, 其对应的
v
M
F
(
e
1
,
κ
)
vMF(\bm{e}_1, \kappa)
vMF(e1,κ) 也变成了
v
M
F
(
μ
,
κ
)
vMF(\bm{\mu}, \kappa)
vMF(μ,κ).
至此, 只剩一个问题没解决: t t t 的采样. 由于 f r a d i a l ( t ; κ , p ) f_{radial}(t; \kappa, p) fradial(t;κ,p) 的复杂性, 对 t t t 的采样比较麻烦, 先从特殊情况 p = 3 p=3 p=3 讲起, 由于消除了 ( 1 − t 2 ) p − 3 2 (1-t^2)^\frac{p-3}{2} (1−t2)2p−3, 利用 Inverse Transform Sampling 可实现高效采样. 但当 p > 3 p \gt 3 p>3 时需要用到 Rejection Sampling, 且相当复杂.
5.1 p = 3 p=3 p=3 时的 Inverse Transform Sampling
当 p = 3 p=3 p=3 时, t t t 的概率密度函数为: f r a d i a l ( t ; κ , 3 ) = ( κ / 2 ) 1 2 π Γ ( 1 ) I 1 2 ( κ ) ( 1 − t 2 ) 0 e x p ( κ t ) = κ 1 2 2 π I 1 2 ( κ ) e x p ( κ t ) \begin{aligned} f_{radial}(t; \kappa, 3) &= \frac{(\kappa/2)^{\frac{1}{2}}}{\sqrt{\pi}\Gamma(1)I_{\frac{1}{2}}(\kappa)} (1-t^2)^{0} exp(\kappa t) \\ &= \frac{\kappa^{\frac{1}{2}}}{\sqrt{2\pi}I_{\frac{1}{2}}(\kappa)} exp(\kappa t) \end{aligned} fradial(t;κ,3)=πΓ(1)I21(κ)(κ/2)21(1−t2)0exp(κt)=2πI21(κ)κ21exp(κt) 此时 ( 1 − t 2 ) p − 3 2 (1-t^2)^{\frac{p-3}{2}} (1−t2)2p−3 项消失, 可以积出其累积分布函数, 然后利用 Inverse Transform Sampling 就可以高效地完成采样.
Inverse Transform Sampling: 任何连续分布的累积分布函数 F ( x ) F(x) F(x) 服从 U ( 0 , 1 ) U(0,1) U(0,1) 分布.
计算其累积分布函数:
F
(
t
;
κ
,
3
)
=
∫
−
1
t
f
r
a
d
i
a
l
(
x
;
κ
,
3
)
d
x
=
κ
1
2
2
π
I
1
2
(
κ
)
∫
−
1
t
e
κ
x
d
x
=
1
2
π
κ
I
1
2
(
κ
)
e
κ
x
∣
−
1
t
=
e
κ
t
−
e
−
κ
2
π
κ
I
1
2
(
κ
)
F
(
1
;
κ
,
3
)
=
e
κ
t
−
e
−
κ
2
π
κ
I
1
2
(
κ
)
=
1
⟹
F
(
t
;
κ
,
3
)
=
e
κ
t
−
e
−
κ
e
κ
−
e
−
κ
\begin{aligned} F(t; \kappa, 3) &= \int_{-1}^{t}f_{radial}(x; \kappa, 3) dx \\ &= \frac{\kappa^{\frac{1}{2}}}{\sqrt{2\pi}I_{\frac{1}{2}}(\kappa)} \int_{-1}^{t}e^{\kappa x} dx \\ &= \frac{1}{\sqrt{2\pi\kappa}I_{\frac{1}{2}}(\kappa)} e^{\kappa x}|_{-1}^{t} \\ &= \frac{e^{\kappa t} - e^{-\kappa}}{\sqrt{2\pi\kappa}I_{\frac{1}{2}}(\kappa)} \\ F(1; \kappa, 3) &= \frac{e^{\kappa t} - e^{-\kappa}}{\sqrt{2\pi\kappa}I_{\frac{1}{2}}(\kappa)} = 1 \\ \Longrightarrow F(t; \kappa, 3) &= \frac{e^{\kappa t} - e^{-\kappa}}{e^{\kappa} - e^{-\kappa}} \end{aligned}
F(t;κ,3)F(1;κ,3)⟹F(t;κ,3)=∫−1tfradial(x;κ,3)dx=2πI21(κ)κ21∫−1teκxdx=2πκI21(κ)1eκx∣−1t=2πκI21(κ)eκt−e−κ=2πκI21(κ)eκt−e−κ=1=eκ−e−κeκt−e−κ 按照 Inverse Transform Sampling, 令
u
=
F
(
t
;
κ
,
3
)
∼
U
(
0
,
1
)
u = F(t;\kappa,3) \sim U(0,1)
u=F(t;κ,3)∼U(0,1), 得:
t
=
1
κ
l
n
(
u
e
κ
+
(
1
−
u
)
e
−
κ
)
t = \frac{1}{\kappa} ln(ue^{\kappa} + (1-u)e^{-\kappa})
t=κ1ln(ueκ+(1−u)e−κ) 与 Wikipedia 中的 是等价的, Wikipedia 的写法不带
e
κ
e^\kappa
eκ, 当
κ
\kappa
κ 比较大的时候, 也许可以避免溢出.
画出其
t
t
t 随
u
u
u 变化的图像:
可见, 当
u
∼
U
(
0
,
1
)
u \sim U(0,1)
u∼U(0,1) 时,
t
∈
(
−
1
,
1
)
t \in (-1,1)
t∈(−1,1), 且
κ
\kappa
κ 越大,
t
t
t 越倾向于
1
1
1. 例如, 当
κ
=
10
\kappa=10
κ=10 时, 随机取一个
u
u
u,
t
t
t 几乎必然落在
[
0.5
,
1
]
[0.5, 1]
[0.5,1] 内. 当
κ
=
0
\kappa=0
κ=0 时(可通过洛必达法则算极限),
t
=
2
u
−
1
∼
U
(
−
1
,
1
)
t=2u-1 \sim U(-1,1)
t=2u−1∼U(−1,1).
5.2 p > 3 p > 3 p>3 时的 Rejection Sampling
此时累积分布函数不好求, 只能使用 Rejection Sampling 了.
接受-拒绝法的基本想法如下。假设 p ( x ) p(x) p(x) 不可以直接抽样。找一个可以直接抽样的分布,称为建议分布 (proposal distribution)。假设 q ( x ) q(x) q(x) 是建议分布的概率密度数,并且有 q ( x ) q(x) q(x) 的 c c c 倍一定大于等于 p ( x ) p(x) p(x),其中 c > 0 c>0 c>0, 如图所示。
按照 q ( x ) q(x) q(x) 进行抽样, 假设得到结果是 x ∗ x^* x∗, 再按照 p ( x ∗ ) c q ( x ∗ ) \frac{p(x^*)}{cq(x^*)} cq(x∗)p(x∗) 的比例随机决定是否接受 x ∗ x^* x∗。直观上,落到 p ( x ∗ ) p(x^*) p(x∗) 范围内的就接受(绿色), 落到 p ( x ∗ ) p(x^*) p(x∗) 范围外的就拒绝(红色)。接受-拒绝法实际是按照 p ( x ) p(x) p(x) 的涵盖面积(或涵盖体积)占 c q ( x ) cq(x) cq(x) 的涵盖面积(或涵盖体积)的比例进行抽样。
假设有 x 1 , x 2 x_1, x_2 x1,x2, 其实我们需要的是: 采到 x 1 x_1 x1 的概率和采到 x 2 x_2 x2 的概率比值是 p ( x 1 ) p ( x 2 ) \frac{p(x_1)}{p(x_2)} p(x2)p(x1), 我们来看一看按照上面的方式采样, 到底是不是这样. 首先, 根据 q ( x ) q(x) q(x) 采样得到 x 1 x_1 x1 的概率是 q ( x 1 ) q(x_1) q(x1), 被接受的概率是 p ( x 1 ) c q ( x 1 ) \frac{p(x_1)}{cq(x_1)} cq(x1)p(x1), 那么整体来讲采到 x 1 x_1 x1 的概率是 q ( x 1 ) p ( x 1 ) c q ( x 1 ) = p ( x 1 ) c q(x_1)\frac{p(x_1)}{cq(x_1)} = \frac{p(x_1)}{c} q(x1)cq(x1)p(x1)=cp(x1), 同理得到 x 2 x_2 x2 的概率是 p ( x 2 ) c \frac{p(x_2)}{c} cp(x2), 那么两者比值为 p ( x 1 ) c / p ( x 2 ) c = p ( x 1 ) p ( x 2 ) \frac{p(x_1)}{c}/\frac{p(x_2)}{c} = \frac{p(x_1)}{p(x_2)} cp(x1)/cp(x2)=p(x2)p(x1).
接受率的均值: ∫ p ( x ) c q ( x ) q ( x ) d x = 1 c \int \frac{p(x)}{cq(x)} q(x)dx = \frac{1}{c} ∫cq(x)p(x)q(x)dx=c1 所以, 在保证 c q ( x ) ≥ p ( x ) cq(x) \ge p(x) cq(x)≥p(x) 的同时, 要尽量使 c c c 小, 以使接受概率最大化.
下面介绍来自论文 Computer Generation of Distributions on the m-spher 的接受-拒绝采样法. 其所选的 proposal distribution 概率密度函数为: e ( t ; b ) = 2 b p − 1 2 B ( p − 1 2 , p − 1 2 ) ( 1 − t 2 ) p − 3 2 [ ( 1 + b ) − ( 1 − b ) t ] p − 1 其中 t ∈ ( − 1 , 1 ) , b ∈ ( 0 , 1 ) \begin{aligned} e(t; b) =& \frac{2b^{\frac{p-1}{2}}}{\Beta(\frac{p-1}{2},\frac{p-1}{2})} \frac{(1-t^2)^{\frac{p-3}{2}}}{[(1+b)-(1-b)t]^{p-1}} ~~~~~ 其中 ~ t \in (-1,1), b \in (0,1) \end{aligned} e(t;b)=B(2p−1,2p−1)2b2p−1[(1+b)−(1−b)t]p−1(1−t2)2p−3 其中 t∈(−1,1),b∈(0,1) 论文说: 从 B e t a Beta Beta 分布 B e ( p − 1 2 , p − 1 2 ) Be(\frac{p-1}{2},\frac{p-1}{2}) Be(2p−1,2p−1), 中采样得到 y y y, 再计算 t = 1 − ( 1 + b ) y 1 − ( 1 − b ) y t = \frac{1-(1+b)y}{1-(1-b)y} t=1−(1−b)y1−(1+b)y, 则 t t t 满足概率密度函数 e ( x , b ) e(x, b) e(x,b). 至于 B e t a Beta Beta 分布的采样, 有不少 Python 包可以使用, 所以, “可以直接抽样的分布” 找到了.
另外, 为了最大化接受率, 求得最佳 b 0 = − 2 κ + 4 κ 2 + ( p − 1 ) 2 p − 1 b_0 = \frac{-2\kappa+\sqrt{4\kappa^2+(p-1)^2}}{p-1} b0=p−1−2κ+4κ2+(p−1)2, 经过一番复杂的计算, 论文给出的采样算法是:
- Initialization: Set b = − 2 κ + 4 κ 2 + ( p − 1 ) 2 p − 1 a = ( p − 1 ) + 2 κ + 4 κ 2 + ( p − 1 ) 2 4 d = 4 a b ( 1 + b ) − ( p − 1 ) l o g ( p − 1 ) \begin{aligned} b &= \frac{-2\kappa+\sqrt{4\kappa^2+(p-1)^2}}{p-1} \\ a &= \frac{(p-1) + 2\kappa + \sqrt{4\kappa^2+(p-1)^2}}{4} \\ d &= \frac{4ab}{(1+b)} - (p-1)log(p-1) \end{aligned} bad=p−1−2κ+4κ2+(p−1)2=4(p−1)+2κ+4κ2+(p−1)2=(1+b)4ab−(p−1)log(p−1)
- Generate y ∼ B e ( p − 1 2 , p − 1 2 ) y \sim Be(\frac{p-1}{2},\frac{p-1}{2}) y∼Be(2p−1,2p−1) and u ∼ U ( 0 , 1 ) u \sim U(0,1) u∼U(0,1). Compute [原文中可能印刷有错] t = 1 − ( 1 + b ) y 1 − ( 1 − b ) y a n d s = 2 a b 1 − ( 1 − b ) y t = \frac{1-(1+b)y}{1-(1-b)y} ~~ and ~~ s=\frac{2ab}{1-(1-b)y} t=1−(1−b)y1−(1+b)y and s=1−(1−b)y2ab
- If ( p − 1 ) l o g s − s + d < l o g u (p-1)logs-s+d < logu (p−1)logs−s+d<logu, go to step 2. else accept t t t.
经过代码试验, 算法是对的, 包括:
- 采 1000000 个样本, 其均值满足 m e a n = A p ( κ ) μ mean = A_p(\kappa)\bm{\mu} mean=Ap(κ)μ;
- 计算这 1000000 个样本与
μ
\bm{\mu}
μ 的内积, 并计算内积的均值, 再使用
scipy.stats.vonmises_fisher
进行同样的操作, 发现内积均值相等, 说明两者分布应该是一样的.
详细推导见 v M F vMF vMF 分布的接受-拒绝采样, 其中还包括拒绝采样的平均接受率.
5.3 Fast Python Sampler of the von Mises Fisher Distribution
对于
t
t
t 的采样, 论文 Fast Python Sampler of the von Mises Fisher Distribution 提出了另一种拒绝采样方法:
大致原理是一样的, 也是将分布
B
e
(
p
−
1
2
,
p
−
1
2
)
Be(\frac{p-1}{2}, \frac{p-1}{2})
Be(2p−1,2p−1) 变换后, 作为建议分布, 执行拒绝采样, 论文声称速度会更快, 但经过推导, 发现其和上面的采样算法是一模一样的. 其详细推导见 Appendix 5.3.