[NeurIPS 2018] Hyperbolic neural networks

Introduction

  • 作者认为,目前双曲几何的表征能力还不及欧氏几何的原因在于还没有相应的 hyperbolic neural network layers,这使得我们很难将 hyperbolic embeddings 应用到下游任务中。为此,作者将 Möbius gyrovector spacesPoincaré model 进行了结合,最终推导出了一些神经网络的双曲版本:多项式逻辑回归模型 (Multinomial logistic regression, MLR), 前馈网络 (FFNN) 和 GRU 等循环神经网络 (RNN),这使得我们能在双曲空间中进行数据嵌入和分类
  • 这篇工作让我们能更好地在双曲空间中进行数据嵌入和分类,也给出了结合欧式模型和双曲模型的方法,这能启发我们更好地运用 hyperbolic embeddings. 下面是一些关于实验部分的问题:作者在实验时使用的 embed 维数还是很小的,而现在一般模型的 embed 维数很多都是 512、1024 等,这种小维数的实验设置有利于双曲模型,不知道在大维数的条件下双曲模型是否还具备优势?另外实验部分的结果似乎也表明,双曲模型只有在数据非常符合树形结构的情况下才有用,否则很可能性能还不如欧式模型;最后,作者在论文中提到 “highly non-convex spectrum of hyperbolic neural networks sometimes results in convergence to poor local minima, suggesting that initialization is very important”,这是否意味着双曲模型的训练比较不稳定

The Geometry of the Poincaré Ball

Hyperbolic space: the Poincaré ball

  • Poincaré ball 可以表示为 ( D n , g D ) (\mathbb D^n,g^{\mathbb D}) (Dn,gD),其中 D n = { x ∈ R n : ∥ x ∥ < 1 } \mathbb D^n=\{x\in\R^n:\|x\|<1\} Dn={xRn:x<1} g D g^{\mathbb D} gDRiemannian metric:
    在这里插入图片描述其中 g E = I n g^E=I_n gE=In 为 Euclidean metric tensor. Induced distance
    在这里插入图片描述同时 Poincaré ball model 还具有保角性
    在这里插入图片描述

Gyrovector spaces (陀螺矢量空间)

  • 在欧氏几何中,向量空间为我们提供了向量加减、标量乘等代数运算操作,而在双曲几何中,gyrovector spaces 则同样提供了这些代数运算操作,这些运算已经被运用在了狭义相对论中,可以在半径为 c c c (the celerity, i.e. the speed of light) 的 Poincaré ball 中进行速度向量的相加,从而保证得到的速度大小不会超过光速。我们可以定义陀螺矢量空间 D c n : = { x ∈ R n ∣ c ∥ x ∥ 2 < 1 } \mathbb D_c^n:=\{x\in\R^n|c\|x\|^2<1\} Dcn:={xRncx2<1} (i.e., Poincaré ball model of constant negative curvature − c −c c),其中 c ≥ 0 c\geq0 c0. 当 c = 0 c=0 c=0 时,有 D c n = R n \mathbb D_c^n=\R^n Dcn=Rn,当 c > 0 c>0 c>0 时, D c n \mathbb D_c^n Dcn 为半径 1 / c 1/\sqrt c 1/c 的 open ball,当 c = 1 c=1 c=1 时, D c n \mathbb D_c^n Dcn 为单位球体

Möbius addition

  • Möbius addition / Hyperbolic translation. The Möbius addition of x x x and y y y in D c n \mathbb D_c^n Dcn is defined as
    在这里插入图片描述 c = 0 c=0 c=0 时,Möbius addition 就退化为了欧氏几何中的向量加。当 c > 0 c>0 c>0 时,Möbius addition 不满足交换律和结合律,但它满足对任意 x ∈ D c n x\in\mathbb D_c^n xDcn 都存在零元和逆元 x ⊕ c 0 = 0 ⊕ c x = x x \oplus_c \mathbf{0}=\mathbf{0} \oplus_c x=x xc0=0cx=x ( − x ) ⊕ c x = x ⊕ c ( − x ) = 0 (-x) \oplus_c x=x \oplus_c(-x)=\mathbf{0} (x)cx=xc(x)=0. 并且满足左消去律 ( − x ) ⊕ c ( x ⊕ c y ) = y (-x) \oplus_c\left(x \oplus_c y\right)=y (x)c(xcy)=y. 下文作者将用 ⊕ \oplus 表示 ⊕ 1 \oplus_1 1.
    在这里插入图片描述
  • Möbius substraction
    在这里插入图片描述

Möbius scalar multiplication

  • Möbius scalar multiplication. For c > 0 c > 0 c>0, the Möbius scalar multiplication of x ∈ D c n \ { 0 } x\in \mathbb D^n_c\backslash\{\mathbf 0\} xDcn\{0} by r ∈ R r \in \R rR is defined as
    在这里插入图片描述注意到, r ⊗ c 0 : = 0 r \otimes_c \mathbf{0}:=\mathbf{0} rc0:=0. 当 c → 0 c\rightarrow 0 c0 时,可以得到 Euclidean scalar multiplication lim ⁡ c → 0 r ⊗ c x = r x \lim _{c \rightarrow 0} r \otimes_c x=r x limc0rcx=rx. Möbius scalar multiplication 满足如下性质:(1) n n n additions. n ⊗ c x = x ⊕ c ⋯ ⊕ c x n \otimes_c x=x \oplus_c \cdots \oplus_c x ncx=xccx;(2) scalar distributivity. ( r + r ′ ) ⊗ c x = r ⊗ c x ⊕ c r ′ ⊗ c x \left(r+r^{\prime}\right) \otimes_c x=r \otimes_c x \oplus_c r^{\prime} \otimes_c x (r+r)cx=rcxcrcx;(3) scalar associativity. ( r ⊗ c r ′ ) ⊗ c x = r ⊗ c ( r ′ ⊗ c x ) \left(r \otimes_c r^{\prime}\right) \otimes_c x=r \otimes_c\left(r^{\prime} \otimes_c x\right) (rcr)cx=rc(rcx);(4) scaling property. ∣ r ∣ ⊗ c x / ∥ r ⊗ c x ∥ = x / ∥ x ∥ |r| \otimes_c x /\left\|r \otimes_c x\right\|=x /\|x\| rcx/rcx=x/∥x

Distance

  • Distance. If one defines the generalized hyperbolic metric tensor g c g^c gc as the metric conformal to the Euclidean one (i.e., g c = λ x c 2 g E g^c={\lambda_x^c}^2g^E gc=λxc2gE, g E = I n g^E=I_n gE=In 为 Euclidean metric tensor), with conformal factor
    λ x c : = 2 / ( 1 − c ∥ x ∥ 2 ) , \lambda_x^c:=2 /\left(1-c\|x\|^2\right), λxc:=2/(1cx2),then the induced distance function on ( D c n , g c ) (\mathbb D^n_c, g^c) (Dcn,gc) is given by
    在这里插入图片描述注意到,当 c → 0 c\rightarrow 0 c0 时,可以得到欧式空间中的距离公式 lim ⁡ c → 0 d c ( x , y ) = 2 ∥ x − y ∥ \lim _{c \rightarrow 0} d_c(x, y)=2\|x-y\| limc0dc(x,y)=2∥xy,并且当 c = 1 c=1 c=1 时,我们能得到 Poincaré ball 中的距离公式
  • 此外,由 conformal factor,我们还能进一步定义陀螺矢量空间中 x x x 点处切空间中的内积norm:对 u , v ∈ T x B c n u,v\in T_{x}\mathbb B_c^n u,vTxBcn,有
    ⟨ u , v ⟩ x c = ( λ x c ) 2 ⟨ u , v ⟩ ∥ v ∥ x c = λ x c ∥ v ∥ \langle{u}, {v}\rangle_{{x}}^c=(\lambda_x^c)^2\langle{u}, {v}\rangle\\ \|v\|_x^c=\lambda_x^c\|v\| u,vxc=(λxc)2u,vvxc=λxcv

Hyperbolic trigonometry

  • Hyperbolic trigonometry. 双曲空间中的 hyperbolic angles or gyroangles 以及 hyperbolic law of sines in the generalized Poincaré ball ( D c n , g c ) (\mathbb D_c^n, g^c) (Dcn,gc). 详见论文的附录 B

Connecting Gyrovector spaces and Riemannian geometry of the Poincaré ball

  • Geodesics.
    在这里插入图片描述 c → 0 c\rightarrow0 c0 时,我们就得到了欧式几何中的直线
  • Lemma 1. For any x ∈ D n x \in\mathbb D^n xDn and v ∈ T x D c n v \in T_x\mathbb D_c^n vTxDcn s.t. g x c ( v , v ) = 1 g^c_x(v, v) = 1 gxc(v,v)=1, the unit-speed geodesic starting from x x x with direction v v v is given by:
    在这里插入图片描述One can sanity-check that d c ( γ ( 0 ) , γ ( t ) ) = t , ∀ t ∈ [ 0 , 1 ] d_c(\gamma(0),\gamma(t))=t,\forall t\in[0,1] dc(γ(0),γ(t))=t,t[0,1]. Proof. c.f. Appendix B in Ganea, et al.
  • Exponential and logarithmic maps. 指数变换是在对 p ∈ D c n p\in\mathbb D_c^n pDcn 施加微小扰动 v ∈ T p D c n v\in T_p\mathbb D_c^n vTpDcn 后 (可以看作一个速度向量),将切空间上的点映射回陀螺矢量空间上,使得 t ∈ [ 0 , 1 ] ↦ exp ⁡ p c ( t v ) t\in[0,1]\mapsto\exp_p^c(tv) t[0,1]exppc(tv) 是连接了 p p p exp ⁡ p c ( v ) \exp_p^c(v) exppc(v) 的测地线,i.e., a geodesic γ γ γ starting from γ ( 0 ) : = x ∈ M γ(0) := x ∈ M γ(0):=xM with unit-norm direction γ ˙ ( 0 ) : = v ∈ T x M \dot γ(0) := v ∈ T_xM γ˙(0):=vTxM as t ↦ exp ⁡ x ( t v ) t \mapsto \exp_x(tv) texpx(tv) (From a point p p p follow a geodesic in direction v v v, at speed t t t.)。在欧氏空间中,指数变换为 exp ⁡ p ( v ) = p + v \exp_p(v)=p+v expp(v)=p+v. 对数变换则是指数变换的逆变换,给出了从 p ∈ D c n p\in \mathbb D_c^n pDcn r ∈ D c n r\in \mathbb D_c^n rDcn 对应的切空间中的速度向量。在欧氏空间中,对数变换为 log ⁡ p ( r ) = r − p \log_p(r)=r-p logp(r)=rp (图片来自于 Angulo, Jesus. “Structure tensor image filtering using Riemannian L1 and L∞ center-of-mass.” Image Analysis & Stereology 33.2 (2014): 95-105.)
    在这里插入图片描述For any point x ∈ D c n x \in \mathbb D_c^n xDcn, the exponential map exp ⁡ x c : T x D c n → D c n \exp^c_x : T_x\mathbb D_c^n\rightarrow \mathbb D_c^n expxc:TxDcnDcn and the logarithmic map log ⁡ x c : D c n → T x D c n \log^c_x : \mathbb D_c^n\rightarrow T_x\mathbb D_c^n logxc:DcnTxDcn are given for v ≠ 0 v \neq 0 v=0 and y ≠ x y \neq x y=x by:
    在这里插入图片描述 c → 0 c\rightarrow 0 c0 时,就能得到欧氏空间中的指数变换和对数变换。 x = 0 x=0 x=0,对任意 v ∈ T 0 D c n \ { 0 } , y ∈ D c n \ { 0 } v \in T_{\mathbf{0}} \mathbb{D}_c^n \backslash\{\mathbf{0}\}, y \in \mathbb{D}_c^n \backslash\{\mathbf{0}\} vT0Dcn\{0},yDcn\{0},有
    在这里插入图片描述Proof. c.f. Appendix B in Ganea, et al.
  • Möbius scalar multiplication using exponential and logarithmic maps. 由于切空间为欧氏空间,便于进行各种运算,因此下面用指数变换和对数变换重新推导 Möbius scalar multiplication
    在这里插入图片描述套用上述公式还能得到两点间测地线公式和指数变换间的关系
    在这里插入图片描述
  • Parallel transport. Parallel transport P x → y c : T x D c n → T y D c n P^c_{x\rightarrow y}:T_x\mathbb D^n_c\rightarrow T_y\mathbb D^n_c Pxyc:TxDcnTyDcn 定义了两个切空间之间的线性等距映射 (linear isometry),它等价于将 x x x 处切空间内的 tangent vector 沿着 x x x y y y 间的测地线平行移动到 y y y 处切空间得到的切向量。通过 Parallel transport,我们能将两个不同切空间联系起来。In the manifold ( D c n , g c ) (\mathbb D^n_c, g^c) (Dcn,gc), the parallel transport w.r.t. the Levi-Civita connection of a vector v ∈ T 0 D c n v\in T_{\mathbf 0}\mathbb D^n_c vT0Dcn to another tangent space T x D c n T_x\mathbb D^n_c TxDcn is given by the following isometry:
    在这里插入图片描述这个结论在定义和优化由不同切空间共享的参数时很重要,例如 biases in hyperbolic neural layers 或者 parameters of hyperbolic MLR.

详细推导可参考原论文及作者的另一篇文章:
Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the thirty-fifth international conference on machine learning (ICML), 2018.

Hyperbolic Neural Networks

Möbius version

  • 类似于 Möbius scalar multiplication,我们可以定义映射 f : R n → R m f:\R^n\rightarrow\R^m f:RnRm 的 Möbius version. (1) 向量通过对数映射投影至切空间;(2) 在切空间向量通过欧氏算子进行变换;(3) 通过指数映射投影回陀螺矢量空间
    在这里插入图片描述 f f f 连续时,有 lim ⁡ c → 0 f ⊗ c ( x ) = f ( x ) \lim _{c \rightarrow 0} f^{\otimes_c}(x)=f(x) limc0fc(x)=f(x). 上述定义满足如下性质:(1) morphism property. ( f ∘ g ) ⊗ c = f ⊗ c ∘ g ⊗ c (f \circ g)^{\otimes_c}=f^{\otimes_c} \circ g^{\otimes_c} (fg)c=fcgc;(2) direction preserving. f ⊗ c ( x ) / ∥ f ⊗ c ( x ) ∥ = f ( x ) / ∥ f ( x ) ∥ f^{\otimes_c}(x) /\left\|f^{\otimes_c}(x)\right\|=f(x) /\|f(x)\| fc(x)/fc(x)=f(x)/∥f(x) for f ( x ) ≠ 0 f(x)\neq\mathbf0 f(x)=0.
  • 如果有多个映射函数 (对应神经网络中的多层),则它们的复合对应的 Möbius version 为
    在这里插入图片描述如果有多个输入 ( f : R n × R p → R m f: \mathbb{R}^n \times \mathbb{R}^p \rightarrow \mathbb{R}^m f:Rn×RpRm),则 Möbius version 为
    f ⊗ c : ( h , h ′ ) ∈ D c n × D c p ↦ exp ⁡ 0 c ( f ( log ⁡ 0 c ( h ) , log ⁡ 0 c ( h ′ ) ) ) f^{\otimes_c}:\left(h, h^{\prime}\right) \in \mathbb{D}_c^n \times \mathbb{D}_c^p \mapsto \exp _{\boldsymbol0}^c\left(f\left(\log _{\boldsymbol{0}}^c(h), \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)\right) fc:(h,h)Dcn×Dcpexp0c(f(log0c(h),log0c(h)))

Hyperbolic multiclass logistic regression (MLR) (softmax regression)

  • 下面推导 MLR 的双曲版本用于多分类任务
  • 首先看欧氏空间中的 MLR. 给定 K K K 个类别,可以给每个类别学得一个 hyperplane H a , b H_{a,b} Ha,b 用于分类, H a , b H_{a,b} Ha,b 可以由法向量 a a a 和标量偏置 b b b 表示:
    在这里插入图片描述样本属于各个类别的概率可以利用 softmax 表示为
    在这里插入图片描述注意到,由于 d ( x , H a , b ) = ∣ ⟨ a , x ⟩ − b ∣ ∥ a ∥ d(x,H_{a,b})=\frac{|\langle a,x\rangle-b|}{\|a\|} d(x,Ha,b)=aa,xb,因此有
    在这里插入图片描述代入类别概率的表达式可知
    在这里插入图片描述
  • 我们不太容易能直接将 Euclidean hyperplane H a , b H_{a,b} Ha,b 推广到 Poincaré ball 中 (因为有标量偏置 b b b 的存在,双曲空间中没有定义向量和标量的加法),为此作者做了如下的变形。首先令 b = ⟨ a , p ⟩ b=\langle a,p\rangle b=a,p H ~ a , p = H a , ⟨ a , p ⟩ \tilde H_{a,p}=H_{a,\langle a,p\rangle} H~a,p=Ha,a,p,有
    在这里插入图片描述其中 { a } ⊥ \{a\}^\perp {a} 为所有与 a a a 垂直的向量集合。现在我们可以将类别概率写为如下形式
    在这里插入图片描述
  • 有了如上定义,现在我们可以将 MLR 推广到双曲空间。首先,我们可以将 H ~ a , p \tilde H_{a,p} H~a,p 推广到双曲空间得到 Poincaré hyperplanes
    在这里插入图片描述其中 ⟨ ⋅ , ⋅ ⟩ \langle \cdot,\cdot\rangle , 为欧氏空间上的内积。注意到, a a a 是定义在 p p p 的切空间上的 ,而 x , p x,p x,p 都是在 manifold 上的,因此欧氏空间上的 ⟨ − p + x , a ⟩ \langle -p+x,a\rangle p+x,a 被推广为了 p p p 点切空间上的内积 ⟨ log ⁡ p c ( x ) , a ⟩ p \langle \log_p^c(x),a\rangle_p logpc(x),ap. 最后一个等式的证明过程可以参考论文附录 D. 从 Poincaré hyperplane H ~ a , p c \tilde H_{a,p}^c H~a,pc 的定义中可以看出, H ~ a , p c \tilde H_{a,p}^c H~a,pc 也可以被定义为 the union of images of all geodesics in D c n \mathbb D^n_c Dcn orthogonal to a a a and containing p p p a a a 为法向量, p p p 为 hyperplane 上的一点。下面两图展示了双曲空间中的 2D hyperplane 和 3D hyperplane
    在这里插入图片描述在这里插入图片描述
  • 有了 Poincaré hyperplane 的定义,我们可以求得 x ∈ D c n x\in\mathbb D^n_c xDcn H ~ a , p c \tilde H^c_{a,p} H~a,pc 的距离 (证明过程参考论文附录 E)
    在这里插入图片描述
  • 最后将上式代入类别概率公式,并将 + + + 替换为 ⊕ c \oplus_c c,将 ∥ a k ∥ \|a_k\| ak 替换为 g p k c ( a k , a k ) g_{p_k}^c(a_k,a_k) gpkc(ak,ak),可得 Final formula for MLR in the Poincaré ball.
    在这里插入图片描述(这里为什么式 (23) 里省去了符号函数?) 并且当 c → 0 c\rightarrow0 c0 时,有
    p ( y = k ∣ x ) ∝ exp ⁡ ( 4 ⟨ − p k + x , a k ⟩ ) = exp ⁡ ( ( λ p k 0 ) 2 ⟨ − p k + x , a k ⟩ ) = exp ⁡ ( ⟨ − p k + x , a k ⟩ 0 ) p(y=k \mid x) \propto \exp \left(4\left\langle-p_k+x, a_k\right\rangle\right)=\exp \left(\left(\lambda_{p_k}^0\right)^2\left\langle-p_k+x, a_k\right\rangle\right)=\exp \left(\left\langle-p_k+x, a_k\right\rangle_0\right) p(y=kx)exp(4pk+x,ak)=exp((λpk0)2pk+x,ak)=exp(pk+x,ak0)可以得到 Euclidean softmax
  • 参数优化时,由于 p k p_k pk 在 manifold 上,因此可以使用黎曼随机梯度下降进行优化,而 a k a_k ak p k p_k pk 的切空间上,难以直接优化,因此我们可以利用 parallel transport 将其用原点切空间 (欧氏空间) 上的 a k ′ ∈ T 0 D c n = R n a_k'\in T_{\mathbf 0}\mathbb D_c^n=\R^n akT0Dcn=Rn 表示,然后将 a k ′ a_k' ak 作为欧式参数优化即可
    在这里插入图片描述

Hyperbolic feed-forward layers

  • Möbius matrix-vector multiplication. 基于 Möbius version 的定义,我们可以进一步定义更多操作的 Möbius version
    在这里插入图片描述
  • Pointwise non-linearity. If φ : R n → R n \varphi:\R^n\rightarrow \R^n φ:RnRn is a pointwise non-linearity, then its Möbius version φ ⊗ c \varphi^{\otimes_c} φc can be applied to elements of the Poincaré ball.
  • Bias translation. Möbius translation of a point x ∈ D c n x ∈ \mathbb D^n_c xDcn by a bias b ∈ D c n b ∈ \mathbb D^n_c bDcn is given by
    在这里插入图片描述
  • Concatenation of multiple input vectors. 给定 x 1 ∈ D c n , x 2 ∈ D c p , x ∈ D c n × D c p x_1\in\mathbb D_c^n,x_2\in\mathbb D_c^p,x\in\mathbb D_c^n\times\mathbb D_c^p x1Dcn,x2Dcp,xDcn×Dcp x 1 , x 2 x_1,x_2 x1,x2 的连接, M 1 ∈ M m , n ( R ) , M 2 ∈ M m , p ( R ) M_1\in\mathcal M_{m,n}(\mathbb R),M_2\in\mathcal M_{m,p}(\mathbb R) M1Mm,n(R),M2Mm,p(R) 为两个线性变换的矩阵, M ∈ M m , n + p ( R ) M\in\mathcal M_{m,n+p}(\mathbb R) MMm,n+p(R) M 1 M_1 M1 M 2 M_2 M2 的水平连接矩阵,则有
    在这里插入图片描述

Hyperbolic RNN

  • Naive RNN.
    在这里插入图片描述其中 φ \varphi φ 为 tanh / sigmoid / ReLU, W ∈ M m , n ( R ) , U ∈ M m , d ( R ) , b ∈ D c m W \in \mathcal{M}_{m, n}(\mathbb{R}), U \in \mathcal{M}_{m, d}(\mathbb{R}), b \in \mathbb{D}_c^m WMm,n(R),UMm,d(R),bDcm. 如果 x t x_t xt 为欧氏空间中的向量,则需要事先做指数变换 x ~ t : = exp ⁡ 0 c ( x t ) \tilde x_t := \exp^c_{\mathbf0}(x_t) x~t:=exp0c(xt) 再代入上式. The base point x x x is usually set to 0 \mathbf0 0 which makes formulas less cumbersome and empirically has little impact on the obtained results.
  • GRU architecture. 欧氏空间里的 GRU 运算如下,包括 reset 门 r t r_t rt 和 update 门 z t z_t zt
    在这里插入图片描述先写出门控电路 f ( h , h ′ ) : = σ ( h ) ⊙ h ′ f(h,h'):=\sigma(h)\odot h' f(h,h):=σ(h)h 的双曲版本: f ⊗ c ( h , h ′ ) = exp ⁡ 0 c ( σ ( log ⁡ 0 c ( h ) ) ⊙ log ⁡ 0 c ( h ′ ) ) = exp ⁡ 0 c ( diag ( σ ( log ⁡ 0 c ( h ) ) ) ⋅ log ⁡ 0 c ( h ′ ) ) = diag ( σ ( log ⁡ 0 c ( h ) ) ) ⊗ c h ′ f^{\otimes_c}\left(h, h^{\prime}\right)=\exp _{\boldsymbol{0}}^c\left(\sigma\left(\log _{\boldsymbol{0}}^c(h)\right) \odot \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)=\exp _{\boldsymbol{0}}^c\left(\text{diag}(\sigma(\log^c_{\mathbf 0}(h)))\cdot \log _{\boldsymbol{0}}^c\left(h^{\prime}\right)\right)=\text{diag}(\sigma(\log^c_{\mathbf 0}(h)))\otimes_c h' fc(h,h)=exp0c(σ(log0c(h))log0c(h))=exp0c(diag(σ(log0c(h)))log0c(h))=diag(σ(log0c(h)))ch. 因此可以将 reset gate r t r_t rt 和 update gate z t z_t zt 写为
    r t = σ log ⁡ 0 c ( W r ⊗ c h t − 1 ⊕ c U r ⊗ c x t ⊕ c b r ) z t = σ log ⁡ 0 c ( W z ⊗ c h t − 1 ⊕ c U z ⊗ c x t ⊕ c b z ) r_t=\sigma \log _{\mathbf 0}^c\left(W^r \otimes_c h_{t-1} \oplus_c U^r \otimes_c x_t \oplus_c b^r\right)\\ z_t=\sigma \log _{\mathbf 0}^c\left(W^z \otimes_c h_{t-1} \oplus_c U^z \otimes_c x_t \oplus_c b^z\right) rt=σlog0c(Wrcht1cUrcxtcbr)zt=σlog0c(Wzcht1cUzcxtcbz)隐藏单元的更新可以写为
    h ~ t = φ ⊗ c ( W ⊗ c ( diag ⁡ ( r t ) ⊗ c h t − 1 ) ⊕ c U ⊗ c x t ⊕ c b ) = φ ⊗ c ( ( W diag ⁡ ( r t ) ) ⊗ c h t − 1 ⊕ c U ⊗ c x t ⊕ c b ) h t = h t − 1 ⊕ c diag ⁡ ( z t ) ⊗ c ( − h t − 1 ⊕ c h ~ t ) \begin{aligned} \tilde{h}_t&=\varphi^{\otimes_c}\left(W\otimes_c( \operatorname{diag}\left(r_t\right) \otimes_c h_{t-1}) \oplus_c U \otimes_c x_t \oplus_c b\right) \\&=\varphi^{\otimes_c}\left(\left(W \operatorname{diag}\left(r_t\right)\right) \otimes_c h_{t-1} \oplus_c U \otimes_c x_t \oplus_c b\right) \\h_t&=h_{t-1} \oplus_c \operatorname{diag}\left(z_t\right) \otimes_c\left(-h_{t-1} \oplus_c \tilde{h}_t\right) \end{aligned} h~tht=φc(Wc(diag(rt)cht1)cUcxtcb)=φc((Wdiag(rt))cht1cUcxtcb)=ht1cdiag(zt)c(ht1ch~t)

Experiments

  • SNLI task and dataset. SNLI 为 natural language inference / textual entailment 数据集 (判断给定前提是否蕴含给定假设),包含了 570K training, 10K validation and 10K test 句子对
  • PREFIX task and datasets. PREFIX 是作者人工合成的数据集,用于测试双曲模型在符合树状结构的数据上的性能。任务为 detection of noisy prefixes, i.e. 给定句子对,判断第二个句子是否为第一个句子的带噪前缀,或是一个随机句子。PREFIX-Z% (for Z being 10, 30 or 50) 表示对于对一个句子的随机前缀,第二个句子的正样本通过替换前缀中 Z% 的单词来生成,负样本则为随机生成的等长句子
  • Models architecture. 双曲模型可以像欧式模型一样叠加 n n n 层构造网络,也可以结合欧式模型一起使用,但优化时必须使用黎曼优化。作者使用两个不同的 RNN 或 GRU 模型编码两个句子,得到的 embed 和这两个句子间的 squared distance (hyperbolic or Euclidean, depending on their geometry) 一起送入 FFNN (Euclidean or hyperbolic),最后由 MLR (Euclidean or hyperbolic) 进行分类,损失函数为 CE loss
  • Results. 可以看到欧式模型在 SNLI 上性能优于双曲模型,作者认为这可能是因为 Adam 等优化算法还没有对应的双曲版本。双曲模型在具有树形结构的数据上性能优于欧式模型,在 PREFIX 数据集上,随着 Z 值越来越大,数据就越来越不符合树形结构,欧式模型和双曲模型之间的性能差距也就越来越小
    在这里插入图片描述
  • MLR classification experiments. 在 SNLI 数据集上,双曲 MLR 相比欧式 MLR 没有展现出足够的优势,作者认为这可能是因为在端到端训练时,模型得到的 embed 可以使得欧式 MLR 就已经能很好地进行分类。为了进一步展示双曲 MLR 的优势,作者进行了额外的实验,选取 WordNet 的子树,判断 node 是否属于该子树。模型结构上使用 WordNet 上预训练得到的 word embed,然后分别使用 hyper-bolic MLR, Euclidean MLR applied directly on the hyperbolic embeddings 以及 Euclidean MLR applied after mapping all embeddings in the tangent space at 0 \mathbf0 0 using the log ⁡ 0 \log_{\mathbf 0} log0 map 进行二分类
    在这里插入图片描述下图展示了 2-dimensional embeddings and the trained separation hyperplanes
    在这里插入图片描述

References

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Giuseppe Ciaburro; Balaji Venkateswaran 著 Neural Network and Artificial Intelligence Concepts Introduction Inspiration for neural networks How do neural networks work? Layered approach Weights and biases Training neural networks Supervised learning Unsupervised learning Epoch Activation functions Different activation functions Linear function Unit step activation function Sigmoid Hyperbolic tangent Rectified Linear Unit Which activation functions to use? Perceptron and multilayer architectures Forward and backpropagation Step-by-step illustration of a neuralnet and an activation function Feed-forward and feedback networks Gradient descent Taxonomy of neural networks Simple example using R neural net library - neuralnet() Let us go through the code line-by-line Implementation using nnet() library Let us go through the code line-by-line Deep learning Pros and cons of neural networks Pros Cons Best practices in neural network implementations Quick note on GPU processing Summary Learning Process in Neural Networks What is machine learning? Supervised learning Unsupervised learning Reinforcement learning Training and testing the model The data cycle Evaluation metrics Confusion matrix True Positive Rate True Negative Rate Accuracy Precision and recall F-score Receiver Operating Characteristic curve Learning in neural networks Back to backpropagation Neural network learning algorithm optimization Supervised learning in neural networks Boston dataset Neural network regression with the Boston dataset Unsupervised learning in neural networks? Competitive learning Kohonen SOM Summary Deep Learning Using Multilayer Neural Networks Introduction of DNNs R for DNNs Multilayer neural networks with neuralnet Training and modeling a DNN using H2O Deep autoencoders using H2O Summary Perceptron Neural Network Modeling – Basic Models Perceptrons and their applications Simple perceptron – a linear separable classifier Linear separation The perceptron function in R Multi-Layer Perceptron MLP R implementation using RSNNS Summary Training and Visualizing a Neural Network in R Data fitting with neural network Exploratory analysis Neural network model Classifing breast cancer with a neural network Exploratory analysis Neural network model The network training phase Testing the network Early stopping in neural network training Avoiding overfitting in the model Generalization of neural networks Scaling of data in neural network models Ensemble predictions using neural networks Summary Recurrent and Convolutional Neural Networks Recurrent Neural Network The rnn package in R LSTM model Convolutional Neural Networks Step #1 – filtering Step #2 – pooling Step #3 – ReLU for normalization Step #4 – voting and classification in the fully connected layer Common CNN architecture - LeNet Humidity forecast using RNN Summary Use Cases of Neural Networks – Advanced Topics TensorFlow integration with R Keras integration with R MNIST HWR using R LSTM using the iris dataset Working with autoencoders PCA using H2O Autoencoders using H2O Breast cancer detection using darch Summary
This current book provides new research on artificial neural networks (ANNs). Topics discussed include the application of ANNs in chemistry and chemical engineering fields; the application of ANNs in the prediction of biodiesel fuel properties from fatty acid constituents; the use of ANNs for solar radiation estimation; the use of in silico methods to design and evaluate skin UV filters; a practical model based on the multilayer perceptron neural network (MLP) approach to predict the milling tool flank wear in a regular cut, as well as entry cut and exit cut, of a milling tool; parameter extraction of small-signal and noise models of microwave transistors based on ANNs; and the application of ANNs to deep-learning and predictive analysis in semantic TCM telemedicine systems. Chapter 1 - Today, the main effort is focused on the optimization of different processes in order to reduce and provide the optimal consumption of available and limited resources. Conventional methods such as one-variable-at-a-time approach optimize one factor at a time instead of all simultaneously. Unlike this method, artificial neural networks provide analysis of the impact of all process parameters simultaneously on the chosen responses. The architecture of each network consists of at least three layers depending on the nature of process which to be analyzed. The optimal conditions obtained after application of artificial neural networks are significantly improved compared with those obtained using conventional methods. Therefore artificial neural networks are quite common method in modeling and optimization of various processes without the full knowledge about them. For example, one study tried to optimize consumption of electricity in electric arc furnace that is known as one of the most energy-intensive processes in industry. Chemical content of scrap to be loaded and melted in the furnace was selected as the input variable while the specific electricity consumption was the output variable. Other studies modeled the extraction and adsorption processes. Many process parameters, such as extraction time, nature of solvent, solid to liquid ratio, extraction temperature, degree of disintegration of plant materials, etc. have impact on the extraction of bioactive compounds from plant materials. These parameters are commonly used as input variables, while the yields of bioactive compounds are used as output during construction of artificial neural network. During the adsorption, the amount of adsorbent and adsorbate, adsorption time, pH of medium are commonly used as the input variables, while the amount of adsorbate after treatment is selected as output variable. Based on the literature review, it can be concluded that the application of artificial neural networks will surely have an important role in the modeling and optimization of chemical processes in the future. viii Gayle Cain Chapter 2 - Problems in chemistry and chemical engineering are composed of complex systems. Various chemical processes in chemistry and chemical engineering can be described by different mathematical functions as, for example, linear, quadratic, exponential, hyperbolic et al. There are many of calculated and experimental descriptors/molecular properties to describe the chemical behavior of the substances. It is also possible that many variables can influence the desired response. Usually, chemometrics is widely used as a valuable tool to deal chemical data, and to solve complex problems. In this context, Artificial Neural Networks (ANN) is a chemometric tool that may provide accurate results for complex and non-linear problems that demand high computational costs. The main advantages of ANN techniques include learning and generalization ability of data, fault tolerance and inherent contextual information processing in addition to fast computation capacity. Due to the popularization, there is a substantial interest in ANN techniques, in special in their applications in various fields. The following types of applications are considered: data reduction using neural networks, overlapped signal resolution, experimental design and surface response, modeling, pattern recognition, and multivariate regression. Chapter 3 - Energy consumption in buildings and indoor thermal comfort nowadays issues in engineering applications. A deep analysis of these problems generally requires many resources. Many studies were carried out in order to improve the methodology available for the evaluation of the energy consumption or indoor thermal conditions; interesting solutions with a very good feedback found in the Literature are the Artificial Neural Networks (ANNs). The peculiarity of ANNs is the opportunity of simulating and resolving complex problems thanks to their architecture, which allows to identify the combination of the involved parameters even when they are in a large amount. The Artificial Neural Networks (ANNs) are very common in engineering applications for simulating the energy performance of buildings, for predicting a particular parameter, or for evaluating the indoor thermal conditions in specific environments. However, many different Artificial Neural Networks are available and each of them should be applied in a specific field. This chapter examines and describes the ANNs generally used in the engineering field. Studies of ANNs applied in topics such as energy consumption in buildings, gas emissions evaluation, indoor and outdoor thermal conditions calculation, renewable energy sources investigation, and lighting and acoustics applications are reported. After a brief description of the main characteristics of ANNs, which allows to focus on the main peculiarity and characteristics of this kind of algorithms, some applications shown in the Literature and applied to engineering issues are described. In the first part of the chapter an analysis of the main parameters which influence the ANN implementation in the examined papers was carried out, then some applications of ANN in energy and buildings field found in the Literature are described. In particular, the main studies were described considering five different clusters: in the first group the ANN applications to buildings and traditional energy plants are showed, in the second one the ANN implementation for the thermal and energy performance evaluation of renewable energy sources are reported. In the third and forth clusters the applications found in the Literature for the indoor thermal parameters investigation and outdoor thermal conditions calculation are described, while in the last one other topics investigated using ANN models such as lighting and acoustics issues are considered. Preface ix Chapter 4 - Biodiesel is generally accepted as an alternative fuel to fossil-derived diesel and has been produced from numerous oil-based biological sources. Determination of fuel properties of biodiesel has mainly being experimental which in most cases is expensive, time consuming and strenuous. These fuel properties are strongly linked to fatty acid (FA) composition of the oil used in biodiesel production. This paper presents the application of artificial neural network (ANN) in predicting selected biodiesel fuel properties (cetane number (CN), flash point (FP), kinematic viscosity (KV) and density) from the FA compositions of the oils contained in raw materials employed in biodiesel production. ANNs are nonlinear computer algorithms which are widely and successfully applied in many fields of study in simulating complex problems. Palmitic, stearic, oleic, linoleic and linolenic acids were observed to be the principal FAs in oils gathered from 58 feedstocks sourced from in literature. FAs outside the five prominent FAs were embedded into them based on their levels of saturation and unsaturation, and were used as inputs in training the networks. Neural network toolbox in MATLAB® (2013b) was employed in this study. Data of FAs and fuel properties were used in training CN, KV, FP and density networks based on back propagation algorithm. Levenberg–Marquardt algorithm, logsig (hidden layer) and purelin (out layer) were used as training algorithm and transfer functions, respectively. Different architectures (5-6-4 (CN and FP); 5-7-4 (KV); 6-5-4 (density)) were employed in training the networks due to variation in the number of neurons in both the input (temperature as additional parameter) and hidden layers. In this study, the networks achieved high accuracy for the prediction of CN, KV, FP and density with correlation coefficients of 0.962, 0.943, 0.987 and 0.985, respectively. This result indicates good agreement between the predicted results and the experimental values, and those of previous studies found in literature. Errors associated with the prediction performance of the networks were estimated using statistical methods and were found to be within satisfactory range of accuracy. Finally, this study shows that the networks via ANN modelling can be alternative methods in predicting CN, KV, FP and density from FA compositions outside the intricate and time-consuming standard test methods. Chapter 5 - The objective of this paper is show how ANN methods can be used for solar radiation estimation at short time-scale (5-min): firstly an ANN method was applied for estimating horizontal solar irradiation from other meteorological parameters more easily and frequently measured over the World and a second ANN model was developed for transforming horizontal solar irradiation into tilted irradiation. Only one thousand continental stations around the world measures solar radiation and often with a poor quality. The authors showed that 5-min solar irradiations can be estimated from more available, more readily measurable and cheaper data using Artificial Neural Networks (ANN). 7 meteorological parameters and 3 calculated parameters are used as inputs, thus 1023 combinations of inputs data are possible; the best combinations of inputs are pursued. The best ANN models have a good adequacy mainly with sunshine duration in the input set. The 6 and 10 inputs models have a relative root means square error (nRMSE) equal to 19.35% and 18.65% which is very good for such a time-step. Solar collector are rarely in horizontal position; However, solar radiation is always measured in a horizontal plane; converting measured horizontal global solar irradiance in tilted ones is a difficult task, particularly for a small time-step and for not-averaged data. Conventional methods (statistical, correlation, ...) are not always efficient with time-step less than one hour; thus, the authors want to know if an Artificial Neural Network (ANN) is able to realize this conversion with a good accuracy when applied to 5-min solar radiation data. x Gayle Cain nRMSE is around 8% for the optimal configuration, which corresponds to a very good accuracy for such a short time-step. These two successive studies show the applicability of ANN methods for the estimation of solar radiation; estimating solar radiation is particularly difficult because the sky diffuse component of solar radiation is anisotropic and the relations between parameters are rarely linear. Chapter 6 - Excessive exposure to sunlight is the major cause of progressive skin photo aging, sunburn and skin cancers. The UVB component of sunlight directly damages cellular DNA and leads to the formation of squamous cell carcinomas, while the UVA component of sunlight penetrates deeper into the skin causing DNA damage through generation of reactive oxygen species (ROS). UV filters are the active ingredients in sunscreen products, which protect skin from the dangerous effects of UV light by absorbing, reflecting, or diffusing UV radiation. In order to maintain effective UV protection, sunscreen filters should remain on the skin surface, accumulate in the stratum corneum, forming an effective barrier against UV radiation without transdermally penetrating into the systemic circulation. Further skin penetration significantly reduces their efficacy and may also cause phototoxic and photoallergic skin reactions. However, chemicals in contact with the skin have the potential to be absorbed into the skin and enter the systemic circulation, with several studies reporting that a number of organic filters significantly penetrate the skin. For assessment of dermal absorption, in vitro and in vivo methods are used, although in vitro tests are preferred for ethical reasons and feasibility. Therefore, it would be useful if the skin penetration of a sunscreen filter can be predicted from its chemical structure alone. Computational and QSAR based methods can be quite useful for development of skin permeability models and have been used to relate physicochemical parameters to dermal permeability to predict dermal penetration and absorption of chemicals. Skin penetration or partitioning like sorption processes are generally driven by hydrophobic effects, which are expected to correlate with molecular size and lipophilicity, together with the various intermolecular interactions, which occur between the permeant and the skin. Hence, this study aimed to develop a QSAR using a heterogeneous data set based on published skin penetration data and then to use this established model to predict the skin penetration of UV sunscreen filter molecules. In order to overcome the limitations associated with linear modelling, artificial neural networks (ANNs) were used to build the QSAR model. Sensitivity analysis was also incorporated into the modelling process in order to establish the molecular requirements for the ideal sunscreen filter. The developed model provides insight into the molecular structural requirements that are important for an effective UV sunscreen filter, particularly in relation to dermal absorption. Producing sunscreens with limited dermal absorption of actives is a challenge for the cosmetic industry so the developed QSAR model should prove useful in developing more effective and safer sunscreen actives. Chapter 7 - Milling cutters are important cutting tools used in milling machines to perform milling operations, which are prone to wear and subsequent failure. In this research work, a practical model based on the multilayer perceptron neural network (MLP) approach to predict the milling tool flank wear in a regular cut, as well as entry cut and exit cut, of a milling tool is proposed. Indeed, a MLP–based model was successfully used here to predict the milling tool flank wear (output variable) as a function of the following input variables: the time duration of experiment, depth of cut, feed, type of material, etc. Regression with optimal hyperparameters was performed and a correlation coefficient equal to 0.92 was obtained. To Preface xi accomplish the objective of this study, the experimental dataset represents experiments from runs on a milling machine under various operating conditions. Data sampled by three different types of sensors (acoustic emission sensor, vibration sensor and current sensor) were acquired at several positions. The MLP–based model’s goodness of fit to experimental data confirmed the good performance of this model. Finally, conclusions of this work are exposed. Chapter 8 - Microwave transistors are among the key components of circuits used in modern communication systems. In computer aided design of these circuits it is necessary to use their accurate and reliable models in order to represent them properly. There are a plenty of models developed, but still the models based on a transistor equivalent circuit representation are the most widely used and preferred by the circuit designers. The parameters of equivalent circuit models are extracted from a set of measured characteristics of a transistor to be modeled. For certain models there are analytical approaches for model parameter extraction. However, optimizations in circuit simulators are dominantly applied. Optimizations take a certain amount of time, which is especially important when repeated iterations are needed to determine the model parameters under different transistor working conditions. Artificial neural networks have appeared to be a very convenient tool to develop efficient extraction procedures of device model parameters. In this chapter a comprehensive study of the developed neural network based extraction approaches is given, considering transistor small-signal and noise models. A short introduction on the microwave transistor models and frequently used extraction procedures is given at the beginning, followed by a description of the multilayered neural networks and procedures of their training and validation. The main part of the Chapter refers to several extraction approaches based on neural networks, starting from the development of the extraction procedure, through their validation and up to the final application. The advantages and possible limitations are discussed. Appropriate numerical results are included to illustrate and verify the presented procedures. Chapter 9 - The study aims to establish a deep learning and predictive model in the semantic TCM telemedicine system using Artificial Neural Network Microsoft Azure Machine Learning. In Chinese Medicine diagnosis, four examination methods: Questioning/history taking, inspection, auscultation (listening) and olfaction (smelling), and palpation. Deep learning is an appropriate technique for the clinical decision support. The result is promising. Next step includs studying the herb-herb interaction. And when a model has been validated, it is easy to publish this as a web service with an auto-documented REST API, to be consumed by apps, and in future we deploy as SaaS and Integrative Medicine Model and using the Microsoft Azure and NVidia the state-of-the-art GPU Visualization Infrastructure and GPU Compute Infrastructure.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值