# 1.Stein’s Identity

E p [ ∇ x log ⁡ q ( x ) f ( x ) + ∇ x f ( x ) ] = 0     i f     p ( x ) = q ( x ) \mathbb{E}_{p}\left[\nabla_{x} \log q(x) f(x)+\nabla_{x} f(x)\right]=0\ \ \ if\ \ \ p(x)=q(x)

# 2.向量情况下的距离度量

E x ∼ p [ A q f ( x ) ] = 0    w h e r e    ( q = p ) \mathbb{E}_{x \sim p}\left[\mathcal{A}_{q} f(x)\right] =0\ \ where\ \ (q=p)

# 3.矩阵情况下的距离度量

Stein’s identity可以写成：
E x ∼ p [ A q ϕ ( x ) ] = E x ∼ p [ A q ϕ ( x ) − A p ϕ ( x ) ] = E x ∼ p [ ( s q ( x ) − s p ( x ) ) ϕ ⊤ ( x ) ] \mathbb{E}_{x \sim p}\left[\mathcal{A}_{q} \phi(x)\right]=\mathbb{E}_{x \sim p}\left[\mathcal{A}_{q} \phi(x)-\mathcal{A}_{p} \phi(x)\right]=\mathbb{E}_{x \sim p}\left[(s_q(x)-s_p(x)) \phi^{\top}(x)\right]

E p [ trace ⁡ ( A q ϕ ( x ) ) ] = E p [ ( s q ( x ) − s p ( x ) ) ⊤ ϕ ( x ) ] \mathbb{E}_{p}\left[\operatorname{trace}\left(\mathcal{A}_{q}\phi(x)\right)\right]=\mathbb{E}_{p}\left[\left(\boldsymbol{s}_{q}(x)-\boldsymbol{s}_{p}(x)\right)^{\top} \phi(x)\right]

# 4.Kernel的引入

## 4.1 RKHS

∫ K ( x , y ) ψ ( x ) d x = λ ψ ( y ) \int K(\mathrm{x}, \mathrm{y}) \psi(\mathrm{x}) d \mathrm{x}=\lambda \psi(\mathrm{y})

K ( x , y ) = ∑ i = 0 ∞ λ i ψ i ( x ) ψ i ( y ) K(\mathrm{x}, \mathrm{y})=\sum_{i=0}^{\infty} \lambda_{i} \psi_{i}(\mathrm{x}) \psi_{i}(\mathrm{y})

{ λ i ψ i } i = 1 ∞ \left\{\sqrt{\lambda_{i}} \psi_{i}\right\}_{i=1}^{\infty} 作为一组正交基构建一个希尔伯特空间 H \mathcal{H} 。这个空间中的任何一个函数（向量）都可以表示为这组基的线性组合。如 f = ( f 1 , f 2 , … ) H T f=\left(f_{1}, f_{2}, \ldots\right)_{\mathcal{H}}^{T}

K ( x , ⋅ ) K(\mathrm{x}, \cdot) 表示固定核函数的一个参数为 x x ，即矩阵第 x x 行的一元函数或无限维向量。那么，我们有
K ( x , ⋅ ) = ( λ 1 ψ 1 ( x ) , λ 2 ψ 2 ( x ) , … ) H T K ( y , ⋅ ) = ( λ 1 ψ 1 ( y ) , λ 2 ψ 2 ( y ) , … ) H T < K ( x , ⋅ ) , K ( y , ⋅ ) > H = ∑ i = 0 ∞ λ i ψ i ( x ) ψ i ( y ) = K ( x , y ) \begin{array}{c} K(\mathrm{x}, \cdot)=\left(\sqrt{\lambda_{1}} \psi_{1}(\mathrm{x}), \sqrt{\lambda_{2}} \psi_{2}(\mathrm{x}), \ldots\right)_{\mathcal{H}}^{T} \\ K(\mathrm{y}, \cdot)=\left(\sqrt{\lambda_{1}} \psi_{1}(\mathrm{y}), \sqrt{\lambda_{2}} \psi_{2}(\mathrm{y}), \ldots\right)_{\mathcal{H}}^{T} \\ <K(\mathbf{x}, \cdot), K(\mathbf{y}, \cdot)>_{\mathcal{H}}=\sum_{i=0}^{\infty} \lambda_{i} \psi_{i}(\mathbf{x}) \psi_{i}(\mathbf{y})=K(\mathbf{x}, \mathbf{y}) \end{array}

## 4.2 Kernelized Stein Discrepancy（KSD）

S ( p , q ) = E x , y ∼ p [ ( s q ( x ) − s p ( x ) ) T k ( x , y ) ( s q ( x ) − s p ( x ) ) ] S(p, q)=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{x})\right)^{T} k(\mathbf{x}, \mathbf{y})\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{x})\right)\right]

### 可行性证明

S ( p , q ) = E x , y ∼ p [ ( s q − s p ) T k ( x , y ) ( s q − s p ) ] = E x , y ∼ p [ ( s q − s p ) T ( k ( x , y ) s q + ∇ y k ( x , y ) − k ( x , y ) s p − ∇ y k ( x , y ) ) ] = E x , y ∼ p [ ( s q − s p ) T v ( x , y ) ] \begin{aligned} S(p, q) &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}-s_{p}\right)^{T} k(\mathbf{x}, \mathbf{y})\left(s_{q}-s_{p}\right)\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}-s_{p}\right)^{T}\left(k(\mathbf{x}, \mathbf{y}) s_{q}+\nabla_{y} k(\mathbf{x}, \mathbf{y})-k(\mathbf{x}, \mathbf{y}) s_{p}-\nabla_{\mathbf{y}} k(\mathbf{x}, \mathbf{y})\right)\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}-s_{p}\right)^{T} v(\mathbf{x}, \mathbf{y})\right] \end{aligned}

S ( p , q ) = E x , y ∼ p [ s q T v ( x , y ) − ( ∇ x ln ⁡ p ( x ) ) T v ( x , y ) ] = E x , y ∼ p [ s q T v ( x , y ) ] − ∫ d x d y p ( x ) p ( y ) ( ∇ x ln ⁡ p ( x ) ) T v ( x , y ) = E x , y ∼ p [ s q T v ( x , y ) ] + E x , y ∼ p [ tr ⁡ ∇ x v ( x , y ) ] = E x , y ∼ p [ u q ( x , y ) ] \begin{aligned} S(p, q) &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[s_{q}^{T} v(\mathbf{x}, \mathbf{y})-\left(\nabla_{\mathbf{x}} \ln p(\mathbf{x})\right)^{T} v(\mathbf{x}, \mathbf{y})\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[s_{q}^{T} v(\mathbf{x}, \mathbf{y})\right]-\int d \mathbf{x} d \mathbf{y} p(\mathbf{x}) p(\mathbf{y})\left(\nabla_{\mathbf{x}} \ln p(\mathbf{x})\right)^{T} v(\mathbf{x}, \mathbf{y}) \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[s_{q}^{T} v(\mathbf{x}, \mathbf{y})\right]+\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\operatorname{tr} \nabla_{\mathbf{x}} v(\mathbf{x}, \mathbf{y})\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[u_{q}(\mathbf{x}, \mathbf{y})\right] \end{aligned}

### 易于求解证明

β ( y ) = E x ∼ p [ A q k y ( x ) ] = E x ∼ p [ ( s q ( x ) − s p ( x ) ) k y ( x ) ] \boldsymbol{\beta}(\mathbf{y})=\mathbb{E}_{\mathbf{x} \sim p}\left[\mathcal{A}_{q} k_{\mathbf{y}}(\mathbf{x})\right]=\mathbb{E}_{\mathbf{x} \sim p}\left[\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{x})\right) k_{\mathbf{y}}(\mathbf{x})\right]

S ( p , q ) = E x , y ∼ p [ ( s q ( x ) − s p ( y ) ) T k ( x , y ) ( s q ( x ) − s p ( y ) ) ] = E x , y ∼ p [ ( s q ( x ) − s p ( y ) ) T < k ( x , ⋅ ) , k ( ⋅ , y ) > H ( s q ( x ) − s p ( y ) ) ] = ∑ i = 1 d < E x ∼ p [ ( s q i ( x ) − s p i ( x ) ) k ( x , ⋅ ) ] , E y ∼ p [ ( s q i ( y ) − s p i ( y ) ) k ( ⋅ y ) ] > H = ∑ i = 1 d < β i , β i > H = ∥ β ∥ H d 2 \begin{aligned} S(p, q) &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{y})\right)^{T} k(\mathbf{x}, \mathbf{y})\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{y})\right)\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p}\left[\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{y})\right)^{T}<k(\mathbf{x}, \cdot), k(\cdot, \mathbf{y})>_{\mathcal{H}}\left(s_{q}(\mathbf{x})-s_{p}(\mathbf{y})\right)\right] \\ &=\sum_{i=1}^{d}<\mathbb{E}_{\mathbf{x} \sim p}\left[\left(s_{q}^{i}(\mathbf{x})-s_{p}^{i}(\mathbf{x})\right) k(\mathbf{x}, \cdot)\right], \mathbb{E}_{\mathbf{y} \sim p}\left[\left(s_{q}^{i}(\mathbf{y})-s_{p}^{i}(\mathbf{y})\right) k(\cdot \mathbf{y})\right]>_{\mathcal{H}} \\ &=\sum_{i=1}^{d}<\beta_{i}, \beta_{i}>_{\mathcal{H}}\\ &=\|\boldsymbol{\beta}\|^2_{\mathcal{H}^{d}} \end{aligned}

∥ β ∥ H d = S ( p , q ) = max ⁡ ϕ ∈ H d { E x ∼ p [ tr ⁡ ( A q ϕ ( x ) ) ] ,  s.t.  ∥ ϕ ∥ H d ≤ 1 } \|\boldsymbol{\beta}\|_{\mathcal{H}^{d}}=S(p, q)=\max _{\mathbf{\phi} \in \mathcal{H}^{d}}\left\{\mathbb{E}_{\mathbf{x} \sim p}\left[\operatorname{tr}\left(\mathcal{A}_{q} \mathbf{\phi}(\mathbf{x})\right)\right], \quad \text { s.t. } \quad\|\mathbf{\phi}\|_{\mathcal{H}^{d}} \leq 1\right\}

# 5.SVGD算法

## 5.1 KL divergence的联系

K L ( q [ T ] ∥ p ) = K L ( q ∥ p [ T − 1 ] ) ∇ ϵ K L ( q [ T ] ∥ p ) = − E x ∼ q [ ∇ ϵ log ⁡ p [ T − 1 ] ( x ) ] \begin{array}{c} \mathrm{KL}\left(q_{[T]} \| p\right)=\mathrm{KL}\left(q \| p_{\left[T^{-1}\right]}\right) \\ \nabla_{\epsilon} \mathrm{KL}\left(q_{[T]} \| p\right)=-\mathbb{E}_{x \sim q}\left[\nabla_{\epsilon} \log p_{\left[T^{-1}\right]}(x)\right] \end{array}

∇ ϵ log ⁡ p [ T − 1 ] ( x ) \nabla_{\epsilon} \log p_{\left[T^{-1}\right]}(x) 展开得到
∇ ϵ log ⁡ p [ T − 1 ] ( x ) = s p ( T ( x ) ) ⊤ ∇ ϵ T ( x ) + trace ⁡ ( ( ∇ x T ( x ) ) − 1 ⋅ ∇ ϵ ∇ x T ( x ) ) \nabla_{\epsilon} \log p_{\left[T^{-1}\right]}(x)=s_{p}(\boldsymbol{T}(x))^{\top} \nabla_{\epsilon} \boldsymbol{T}(x)+\operatorname{trace}\left(\left(\nabla_{x} \boldsymbol{T}(x)\right)^{-1} \cdot \nabla_{\epsilon} \nabla_{x} \boldsymbol{T}(x)\right)

T ( x ) = x , ∇ ϵ T ( x ) = ϕ ( x ) , ∇ x T ( x ) = I , ∇ ϵ ∇ x T ( x ) = ∇ x ϕ ( x ) \boldsymbol{T}(x)=x, \quad \nabla_{\epsilon} \boldsymbol{T}(x)=\boldsymbol{\phi}(x), \quad \nabla_{x} \boldsymbol{T}(x)=I, \quad \nabla_{\epsilon} \nabla_{x} \boldsymbol{T}(x)=\nabla_{x} \boldsymbol{\phi}(x) .

# 5.实验

References
[1]Liu Q , Wang D . Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm[C]// 2016.
[2]Liu Q , Jason D.Lee . A Kernelized Stein Discrepancy for Goodness-of-fit Tests[C]// 2016.