[NeurIPS 2018] Hyperbolic neural networks

连理o

已于 2023-05-04 12:38:53 修改

阅读量881

点赞数

文章标签： NeurIPS 2018

于 2023-02-03 14:19:38 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/128780399

版权

papers 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

Introduction
The Geometry of the Poincaré Ball
Hyperbolic Neural Networks
Experiments
References

Introduction

作者认为，目前双曲几何的表征能力还不及欧氏几何的原因在于还没有相应的 hyperbolic neural network layers，这使得我们很难将 hyperbolic embeddings 应用到下游任务中。为此，作者将 Möbius gyrovector spaces 和 Poincaré model 进行了结合，最终推导出了一些神经网络的双曲版本：多项式逻辑回归模型 (Multinomial logistic regression, MLR), 前馈网络 (FFNN) 和 GRU 等循环神经网络 (RNN)，这使得我们能在双曲空间中进行数据嵌入和分类
这篇工作让我们能更好地在双曲空间中进行数据嵌入和分类，也给出了结合欧式模型和双曲模型的方法，这能启发我们更好地运用 hyperbolic embeddings. 下面是一些关于实验部分的问题：作者在实验时使用的 embed 维数还是很小的，而现在一般模型的 embed 维数很多都是 512、1024 等，这种小维数的实验设置有利于双曲模型，不知道在大维数的条件下双曲模型是否还具备优势？另外实验部分的结果似乎也表明，双曲模型只有在数据非常符合树形结构的情况下才有用，否则很可能性能还不如欧式模型；最后，作者在论文中提到 “highly non-convex spectrum of hyperbolic neural networks sometimes results in convergence to poor local minima, suggesting that initialization is very important”，这是否意味着双曲模型的训练比较不稳定？

The Geometry of the Poincaré Ball

Hyperbolic space: the Poincaré ball

Poincaré ball 可以表示为 $(\mathbb D^n,g^{\mathbb D})$ ，其中 $\mathbb D^n=\{x\in\R^n:\|x\|<1\}$ ， $g^{\mathbb D}$ 为 Riemannian metric:
其中 $g^E=I_n$ 为 Euclidean metric tensor. Induced distance 为
同时 Poincaré ball model 还具有保角性

Gyrovector spaces (陀螺矢量空间)

在欧氏几何中，向量空间为我们提供了向量加减、标量乘等代数运算操作，而在双曲几何中，gyrovector spaces 则同样提供了这些代数运算操作，这些运算已经被运用在了狭义相对论中，可以在半径为 $c$ (the celerity, i.e. the speed of light) 的 Poincaré ball 中进行速度向量的相加，从而保证得到的速度大小不会超过光速。我们可以定义陀螺矢量空间 $\mathbb D_c^n:=\{x\in\R^n|c\|x\|^2<1\}$ (i.e., Poincaré ball model of constant negative curvature $- c$ )，其中 $c\geq0$ . 当 $c = 0$ 时，有 $\mathbb D_c^n=\R^n$ ，当 $c > 0$ 时， $\mathbb D_c^n$ 为半径 $1/\sqrt c$ 的 open ball，当 $c = 1$ 时， $\mathbb D_c^n$ 为单位球体

Möbius addition

Möbius addition / Hyperbolic translation. The Möbius addition of $x$ and $y$ in $\mathbb D_c^n$ is defined as
当 $c = 0$ 时，Möbius addition 就退化为了欧氏几何中的向量加。当 $c > 0$ 时，Möbius addition 不满足交换律和结合律，但它满足对任意 $x\in\mathbb D_c^n$ 都存在零元和逆元 $\oplus_c \mathbf{0}=\mathbf{0} \oplus_c x=x$ ， $\oplus_c x=x \oplus_c(-x)=\mathbf{0}$ . 并且满足左消去律 $\oplus_c\left(x \oplus_c y\right)=y$ . 下文作者将用 $\oplus$ 表示 $\oplus_1$ .
Möbius substraction

Möbius scalar multiplication

Möbius scalar multiplication. For $c > 0$ , the Möbius scalar multiplication of $\ { 0 } x\in \mathbb D^n_c\backslash\{\mathbf 0\}$ by $\in \R$ is defined as
注意到， $\otimes_c \mathbf{0}:=\mathbf{0}$ . 当 $c\rightarrow 0$ 时，可以得到 Euclidean scalar multiplication $\lim _{c \rightarrow 0} r \otimes_c x=r x$ . Möbius scalar multiplication 满足如下性质：(1) $n$ additions. $\otimes_c x=x \oplus_c \cdots \oplus_c x$ ；(2) scalar distributivity. $\left(r+r^{\prime}\right) \otimes_c x=r \otimes_c x \oplus_c r^{\prime} \otimes_c x$ ；(3) scalar associativity. $\left(r \otimes_c r^{\prime}\right) \otimes_c x=r \otimes_c\left(r^{\prime} \otimes_c x\right)$ ；(4) scaling property. $\otimes_c x /\left\|r \otimes_c x\right\|=x /\|x\|$

Distance

Distance. If one defines the generalized hyperbolic metric tensor $g^c$ as the metric conformal to the Euclidean one (i.e., $g^c={\lambda_x^c}^2g^E$ , $g^E=I_n$ 为 Euclidean metric tensor), with conformal factor
$\lambda_x^c:=2 /\left(1-c\|x\|^2\right),$ then the induced distance function on $(\mathbb D^n_c, g^c)$ is given by
注意到，当 $c\rightarrow 0$ 时，可以得到欧式空间中的距离公式 $\lim _{c \rightarrow 0} d_c(x, y)=2\|x-y\|$ ，并且当 $c = 1$ 时，我们能得到 Poincaré ball 中的距离公式
此外，由 conformal factor，我们还能进一步定义陀螺矢量空间中 $x$ 点处切空间中的内积和 norm：对 $u,v\in T_{x}\mathbb B_c^n$ ，有
$\langle{u}, {v}\rangle_{{x}}^c=(\lambda_x^c)^2\langle{u}, {v}\rangle\\ \|v\|_x^c=\lambda_x^c\|v\|$

Hyperbolic trigonometry

Hyperbolic trigonometry. 双曲空间中的 hyperbolic angles or gyroangles 以及 hyperbolic law of sines in the generalized Poincaré ball $(\mathbb D_c^n, g^c)$ . 详见论文的附录 B

Connecting Gyrovector spaces and Riemannian geometry of the Poincaré ball

Geodesics.
当 $c\rightarrow0$ 时，我们就得到了欧式几何中的直线
Lemma 1. For any $\in\mathbb D^n$ and $\in T_x\mathbb D_c^n$ s.t. $g^c_x(v, v) = 1$ , the unit-speed geodesic starting from $x$ with direction $v$ is given by:
One can sanity-check that $d_c(\gamma(0),\gamma(t))=t,\forall t\in[0,1]$ . Proof. c.f. Appendix B in Ganea, et al.
Exponential and logarithmic maps. 指数变换是在对 $p\in\mathbb D_c^n$ 施加微小扰动 $v\in T_p\mathbb D_c^n$ 后 (可以看作一个速度向量)，将切空间上的点映射回陀螺矢量空间上，使得 $t\in[0,1]\mapsto\exp_p^c(tv)$ 是连接了 $p$ 和 $exp_p^c(v)$ 的测地线，i.e., a geodesic $γ$ starting from $γ (0) := x \in M$ with unit-norm direction $\dot γ(0) := v ∈ T_xM$ as $\mapsto \exp_x(tv)$ (From a point $p$ follow a geodesic in direction $v$ , at speed $t$ .)。在欧氏空间中，指数变换为 $exp_p(v)=p+v$ . 对数变换则是指数变换的逆变换，给出了从 $p\in \mathbb D_c^n$ 到 $r\in \mathbb D_c^n$ 对应的切空间中的速度向量。在欧氏空间中，对数变换为 $log_p(r)=r-p$ (图片来自于 Angulo, Jesus. “Structure tensor image filtering using Riemannian L1 and L∞ center-of-mass.” Image Analysis & Stereology 33.2 (2014): 95-105.)
For any point $\in \mathbb D_c^n$ , the exponential map $\exp^c_x : T_x\mathbb D_c^n\rightarrow \mathbb D_c^n$ and the logarithmic map $\log^c_x : \mathbb D_c^n\rightarrow T_x\mathbb D_c^n$ are given for $\neq 0$ and $\neq x$ by:
当 $c\rightarrow 0$ 时，就能得到欧氏空间中的指数变换和对数变换。当 $x = 0$ 时，对任意 $\ { 0 } , y ∈ D c n \ { 0 } v \in T_{\mathbf{0}} \mathbb{D}_c^n \backslash\{\mathbf{0}\}, y \in \mathbb{D}_c^n \backslash\{\mathbf{0}\}$ ，有
Proof. c.f. Appendix B in Ganea, et al.
Möbius scalar multiplication using exponential and logarithmic maps. 由于切空间为欧氏空间，便于进行各种运算，因此下面用指数变换和对数变换重新推导 Möbius scalar multiplication
套用上述公式还能得到两点间测地线公式和指数变换间的关系
Parallel transport. Parallel transport $P^c_{x\rightarrow y}:T_x\mathbb D^n_c\rightarrow T_y\mathbb D^n_c$ 定义了两个切空间之间的线性等距映射 (linear isometry)，它等价于将 $x$ 处切空间内的 tangent vector 沿着 $x$ 和 $y$ 间的测地线平行移动到 $y$ 处切空间得到的切向量。通过 Parallel transport，我们能将两个不同切空间联系起来。In the manifold $(\mathbb D^n_c, g^c)$ , the parallel transport w.r.t. the Levi-Civita connection of a vector $v\in T_{\mathbf 0}\mathbb D^n_c$ to another tangent space $T_x\mathbb D^n_c$ is given by the following isometry:
这个结论在定义和优化由不同切空间共享的参数时很重要，例如 biases in hyperbolic neural layers 或者 parameters of hyperbolic MLR.

详细推导可参考原论文及作者的另一篇文章：
Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the thirty-fifth international conference on machine learning (ICML), 2018.

Hyperbolic Neural Networks

Möbius version

类似于 Möbius scalar multiplication，我们可以定义映射 $f:\R^n\rightarrow\R^m$ 的 Möbius version. (1) 向量通过对数映射投影至切空间；(2) 在切空间向量通过欧氏算子进行变换；(3) 通过指数映射投影回陀螺矢量空间
当 $f$ 连续时，有 $\lim _{c \rightarrow 0} f^{\otimes_c}(x)=f(x)$ . 上述定义满足如下性质：(1) morphism property. $\circ g)^{\otimes_c}=f^{\otimes_c} \circ g^{\otimes_c}$ ；(2) direction preserving. $f^{\otimes_c}(x) /\left\|f^{\otimes_c}(x)\right\|=f(x) /\|f(x)\|$ for $f(x)\neq\mathbf0$ .
如果有多个映射函数 (对应神经网络中的多层)，则它们的复合对应的 Möbius version 为
如果有多个输入 ( $\mathbb{R}^n \times \mathbb{R}^p \rightarrow \mathbb{R}^m$ )，则 Möbius version 为
$f^{\otimes_c}:\left(h, h^{\prime}\right) \in \mathbb{D}_c^n \times \mathbb{D}_c^p \mapsto \exp _{\boldsymbol0}^c\left(f\left(\log _{\boldsymbol{0}}^c(h), \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)\right)$

Hyperbolic multiclass logistic regression (MLR) (softmax regression)

下面推导 MLR 的双曲版本用于多分类任务
首先看欧氏空间中的 MLR. 给定 $K$ 个类别，可以给每个类别学得一个 hyperplane $H_{a,b}$ 用于分类， $H_{a,b}$ 可以由法向量 $a$ 和标量偏置 $b$ 表示：
样本属于各个类别的概率可以利用 softmax 表示为
注意到，由于 $d(x,H_{a,b})=\frac{|\langle a,x\rangle-b|}{\|a\|}$ ，因此有
代入类别概率的表达式可知
我们不太容易能直接将 Euclidean hyperplane $H_{a,b}$ 推广到 Poincaré ball 中 (因为有标量偏置 $b$ 的存在，双曲空间中没有定义向量和标量的加法)，为此作者做了如下的变形。首先令 $b=\langle a,p\rangle$ ， $\tilde H_{a,p}=H_{a,\langle a,p\rangle}$ ，有
其中 $\{a\}^\perp$ 为所有与 $a$ 垂直的向量集合。现在我们可以将类别概率写为如下形式
有了如上定义，现在我们可以将 MLR 推广到双曲空间。首先，我们可以将 $\tilde H_{a,p}$ 推广到双曲空间得到 Poincaré hyperplanes
其中 $\langle \cdot,\cdot\rangle$ 为欧氏空间上的内积。注意到， $a$ 是定义在 $p$ 的切空间上的，而 $x, p$ 都是在 manifold 上的，因此欧氏空间上的 $\langle -p+x,a\rangle$ 被推广为了 $p$ 点切空间上的内积 $\langle \log_p^c(x),a\rangle_p$ . 最后一个等式的证明过程可以参考论文附录 D. 从 Poincaré hyperplane $\tilde H_{a,p}^c$ 的定义中可以看出， $\tilde H_{a,p}^c$ 也可以被定义为 the union of images of all geodesics in $\mathbb D^n_c$ orthogonal to $a$ and containing $p$ ， $a$ 为法向量， $p$ 为 hyperplane 上的一点。下面两图展示了双曲空间中的 2D hyperplane 和 3D hyperplane
有了 Poincaré hyperplane 的定义，我们可以求得 $x\in\mathbb D^n_c$ 到 $\tilde H^c_{a,p}$ 的距离 (证明过程参考论文附录 E)
最后将上式代入类别概率公式，并将 $+$ 替换为 $\oplus_c$ ，将 $a_k\|$ 替换为 $g_{p_k}^c(a_k,a_k)$ ，可得 Final formula for MLR in the Poincaré ball.
(这里为什么式 (23) 里省去了符号函数？) 并且当 $c\rightarrow0$ 时，有
$\mid x) \propto \exp \left(4\left\langle-p_k+x, a_k\right\rangle\right)=\exp \left(\left(\lambda_{p_k}^0\right)^2\left\langle-p_k+x, a_k\right\rangle\right)=\exp \left(\left\langle-p_k+x, a_k\right\rangle_0\right)$ 可以得到 Euclidean softmax
在参数优化时，由于 $p_k$ 在 manifold 上，因此可以使用黎曼随机梯度下降进行优化，而 $a_k$ 在 $p_k$ 的切空间上，难以直接优化，因此我们可以利用 parallel transport 将其用原点切空间 (欧氏空间) 上的 $a_k'\in T_{\mathbf 0}\mathbb D_c^n=\R^n$ 表示，然后将 $a_k'$ 作为欧式参数优化即可

Hyperbolic feed-forward layers

Möbius matrix-vector multiplication. 基于 Möbius version 的定义，我们可以进一步定义更多操作的 Möbius version
Pointwise non-linearity. If $\varphi:\R^n\rightarrow \R^n$ is a pointwise non-linearity, then its Möbius version $\varphi^{\otimes_c}$ can be applied to elements of the Poincaré ball.
Bias translation. Möbius translation of a point $\mathbb D^n_c$ by a bias $\mathbb D^n_c$ is given by
Concatenation of multiple input vectors. 给定 $x_1\in\mathbb D_c^n,x_2\in\mathbb D_c^p,x\in\mathbb D_c^n\times\mathbb D_c^p$ 为 $x_1,x_2$ 的连接， $M_1\in\mathcal M_{m,n}(\mathbb R),M_2\in\mathcal M_{m,p}(\mathbb R)$ 为两个线性变换的矩阵， $M\in\mathcal M_{m,n+p}(\mathbb R)$ 为 $M_1$ 和 $M_2$ 的水平连接矩阵，则有

Hyperbolic RNN

Naive RNN.
其中 $\varphi$ 为 tanh / sigmoid / ReLU， $\in \mathcal{M}_{m, n}(\mathbb{R}), U \in \mathcal{M}_{m, d}(\mathbb{R}), b \in \mathbb{D}_c^m$ . 如果 $x_t$ 为欧氏空间中的向量，则需要事先做指数变换 $\tilde x_t := \exp^c_{\mathbf0}(x_t)$ 再代入上式. The base point $x$ is usually set to $\mathbf0$ which makes formulas less cumbersome and empirically has little impact on the obtained results.
GRU architecture. 欧氏空间里的 GRU 运算如下，包括 reset 门 $r_t$ 和 update 门 $z_t$
先写出门控电路 $f(h,h'):=\sigma(h)\odot h'$ 的双曲版本： $f^{\otimes_c}\left(h, h^{\prime}\right)=\exp _{\boldsymbol{0}}^c\left(\sigma\left(\log _{\boldsymbol{0}}^c(h)\right) \odot \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)=\exp _{\boldsymbol{0}}^c\left(\text{diag}(\sigma(\log^c_{\mathbf 0}(h)))\cdot \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)=\text{diag}(\sigma(\log^c_{\mathbf 0}(h)))\otimes_c h'$ . 因此可以将 reset gate $r_t$ 和 update gate $z_t$ 写为
$r_t=\sigma \log _{\mathbf 0}^c\left(W^r \otimes_c h_{t-1} \oplus_c U^r \otimes_c x_t \oplus_c b^r\right)\\ z_t=\sigma \log _{\mathbf 0}^c\left(W^z \otimes_c h_{t-1} \oplus_c U^z \otimes_c x_t \oplus_c b^z\right)$ 隐藏单元的更新可以写为
$\begin{aligned} \tilde{h}_t&=\varphi^{\otimes_c}\left(W\otimes_c( \operatorname{diag}\left(r_t\right) \otimes_c h_{t-1}) \oplus_c U \otimes_c x_t \oplus_c b\right) \\&=\varphi^{\otimes_c}\left(\left(W \operatorname{diag}\left(r_t\right)\right) \otimes_c h_{t-1} \oplus_c U \otimes_c x_t \oplus_c b\right) \\h_t&=h_{t-1} \oplus_c \operatorname{diag}\left(z_t\right) \otimes_c\left(-h_{t-1} \oplus_c \tilde{h}_t\right) \end{aligned}$

Experiments

SNLI task and dataset. SNLI 为 natural language inference / textual entailment 数据集 (判断给定前提是否蕴含给定假设)，包含了 570K training, 10K validation and 10K test 句子对
PREFIX task and datasets. PREFIX 是作者人工合成的数据集，用于测试双曲模型在符合树状结构的数据上的性能。任务为 detection of noisy prefixes, i.e. 给定句子对，判断第二个句子是否为第一个句子的带噪前缀，或是一个随机句子。PREFIX-Z% (for Z being 10, 30 or 50) 表示对于对一个句子的随机前缀，第二个句子的正样本通过替换前缀中 Z% 的单词来生成，负样本则为随机生成的等长句子
Models architecture. 双曲模型可以像欧式模型一样叠加 $n$ 层构造网络，也可以结合欧式模型一起使用，但优化时必须使用黎曼优化。作者使用两个不同的 RNN 或 GRU 模型编码两个句子，得到的 embed 和这两个句子间的 squared distance (hyperbolic or Euclidean, depending on their geometry) 一起送入 FFNN (Euclidean or hyperbolic)，最后由 MLR (Euclidean or hyperbolic) 进行分类，损失函数为 CE loss
Results. 可以看到欧式模型在 SNLI 上性能优于双曲模型，作者认为这可能是因为 Adam 等优化算法还没有对应的双曲版本。双曲模型在具有树形结构的数据上性能优于欧式模型，在 PREFIX 数据集上，随着 Z 值越来越大，数据就越来越不符合树形结构，欧式模型和双曲模型之间的性能差距也就越来越小
MLR classification experiments. 在 SNLI 数据集上，双曲 MLR 相比欧式 MLR 没有展现出足够的优势，作者认为这可能是因为在端到端训练时，模型得到的 embed 可以使得欧式 MLR 就已经能很好地进行分类。为了进一步展示双曲 MLR 的优势，作者进行了额外的实验，选取 WordNet 的子树，判断 node 是否属于该子树。模型结构上使用 WordNet 上预训练得到的 word embed，然后分别使用 hyper-bolic MLR, Euclidean MLR applied directly on the hyperbolic embeddings 以及 Euclidean MLR applied after mapping all embeddings in the tangent space at $\mathbf0$ using the $\log_{\mathbf 0}$ map 进行二分类
下图展示了 2-dimensional embeddings and the trained separation hyperplanes