论文笔记 - kernel-based non-parametric activation functions for neural networks

最新推荐文章于 2022-05-11 09:40:02 发布

不甘心的程序员

最新推荐文章于 2022-05-11 09:40:02 发布

阅读量414

点赞数

分类专栏：论文笔记文章标签： kernel-based non-parametric ac

论文笔记专栏收录该内容

9 篇文章 0 订阅

订阅专栏

abstract

对于神经网络，存在线性层之后会有非线性函数。本片文章主要针对的就是激励函数的设计。本文介绍了一种新的激励函数，它基于每个神经元的核扩张(based on an inexpensive kernel expansion at every neuron)，本文针对设计以及初始化这些核函数(KAFs)提出多种变形。结果表明，KAFs能够近似表示任何定义在实线子集上的映射，不管它是凸还是非凸。KAFs能够在整个域上光滑（smooth over their entire domain），linear in their parameters，并且能够使用任何已知的模式来规则化，包括使用 $L_1$ 。

introduction

在NNs中，激励函数一般是一个element-wise,(sub-)differentiable, and fixed 的非线性函数。现在逐渐从收缩映射（sigmoids）激励函数转为使用分段线性函数（piecewise-linear functions）比如ReLUs，在BP error 的计算上更为有效。
在设计方面有两个问题，第一，我们可以使用少量的参数来参数化一个已知的激励函数，这会导致在灵活性以及性能上有少量的提升

preliminaries

对于一个前馈神经网络，第 $l-th$ 层，
这里写图片描述， $g_l$ is applied element-wise。
训练网络的时候，提供输入输出集合 $S = \{X_i. Y_i\}_{i=1}^ I$ ，损失函数为

对于激励函数的选择，和 tasks 以及 a proper scaling of the output相关，不能随便选取。对于回归问题使用 $g(s) = s$ ， $s$ 代表的是单个输入。对于二分类问题 $y_i ={0,1}$ ，使用 sigmoid function。
这里写图片描述

Fixed activation functions

在NNs中最常使用的是一种非减的函数，也就是说
这里写图片描述，
c为0或者1，依赖于约定。
另外一种是双曲正切

但是在实际的运用中会出现梯度消失和梯度爆炸的问题。
一个突破是ReLU的提出 $g(s) = max\{0,s\}$ 。第一，ReLU的梯度要么是0要么是1，第二，（activations are sparse）激励项是稀疏的。
softplus 是 ReLU的一个smoothed version： $g(s) = log\{1+ exp\{s\}\}$
为了解决‘dying ReLU’，就是说因为初始化或者权重更新问题，激励项变成了0，因此引入 leaky ReLU
这里写图片描述其中 $\alpha > 0$ ,被设为很小，比如0.01.
对于仅仅有非负的输出值激励函数，一个问题是他们的平均值是正的（这是什么问题？？）Another problem of activation functions having only non-negative output values is that their mean value is always positive by de nition.。受自然梯度natural gradient的启发，提出了exponential linear unit (ELU)
这里写图片描述
scaled ELU (SELU)： $g(s) = SELU(s) = \lambda * ELU(s)$ ，其中 $\lambda > 1$
Swish inspired by the gating steps in a standard LSTM recurrent cell 提出了 $g(s) = s * \delta(s) \delta(s)$ 是一个sigmoid。

Parametric adaptable activation functions

提高NN灵活性的一种快速的方法就是参数化之前介绍的激励函数（activations functions with a xed (small) number of adaptable parameters）,只要这些激励函数可以微分，就可以用过数值优化算法（numerical optimization algorithm）进行优化。
因为这些函数的参数数目固定并且缺乏灵活性，所以叫做Parametric adaptable activation functions。
generalized hyperbolic tangent 双曲正切：含有两个正数标量 $a, b$ ，随机初始化，相互独立，a控制输出的范围，b控制曲线的坡度。
$g(s) = \frac {a(1-exp\{-bs\} )} {1+exp\{-bs\}}$

parametric version of the leaky ReLU： $\alpha = 0.25$ ，叫做parametric ReLU (PReLU)
这里写图片描述
a modi cation of the ELU function in (9) with an additional scalar parameter , called parametric ELU(PELU):
，其中 $\alpha, \beta$ 随机初始化，并且在训练过程中更新。

A more flexible proposal is S-shaped ReLU (SReLU)parameterized by four scalar values ${t^r; a^r; t^l; a^l}$
这里写图片描述
a parametric version of the Swish function is $\beta-swish: g(s) = s* \delta(\beta s)$

Non-parametric activation functions

参数化的激励函数的灵活性受限，非参数化的激励函数可以给大规模数据建模（allow to model a larger class of shapes），主要是对超参数的引入。

APL functions 分段线性插值函数

generalizes the SReLU function in (15) by summing multiple linear segments,
$g(s) = max\{0, s\} + \sum_{i=1}^{S} a_i max\{0, -s+b_i\}$
这里写图片描述

spline functions 样条插值（SAF）

多项式插值(polynomial activation function PAF) $g(s) = \sum_{i=0}^P{a_i s^i}$ $P$ 是超参数，有 $(P+1)$ 个系数 $\{a_i\}_{i=0}^P$
drawbacks: 每一个 $a_i$ 有全局的影响。
这里写图片描述

maxout networks

g (h) = max i = 1, 2, 3, . . . K {w T i h + b i}

$g(h) = \max_{i=1,2,3,...K} \{w_i^Th + b_i\}$
这里写图片描述

为了解决 smoothness problem,引入两种变种
soft-maxout:

g (h) = l o g {\sum i = 1 K e x p {w T i h + b i}}

$g(h) = log\{\sum_{i=1}^K {exp\{w_i^Th + b_i\}}\}$

$l_p-maxout$

g (h) = \sum i = 1 K | w T i = 1 h + b i | p - - - - - - - - - - - - -  ⎷   p

$g(h) = \sqrt[p] {\sum_{i=1}^K {|w_{i=1}^T h + b_i|^p}}$
与

lp−maxout l p − m a x o u t $l_p-maxout$ 相关的是

Lpunit L p u n i t $L_p unit$

g (h) = (1 K \sum i = 1 K | s i - c i | p) 1 p

$g(h) =({\frac 1 K} \sum_{i=1}^K {|s_i - c_i|^p})^{\frac 1 p}$ 其中

si=wTih+bi s i = w i T h + b i $s_i = w_i^Th + b_i$

Proposed kernel-based activation functions

g (s) = \sum i = 1 D α i k (s, d i)

$g(s) = \sum_{i=1}^D \alpha_i k(s, d_i)$
这里写图片描述

For our experiments, we use the 1D Gaussian kernel de ned as

k (s, d i) = e x p {- γ (s - d i) 2}

$k(s, d_i) = exp\{-\gamma (s-d_i)^2\}$

γ γ $\gamma$ called the kernel bandwidth
simple derivatives for back-propagation:
这里写图片描述

On the selection of the kernel bandwidth $\gamma$

Selecting $\gamma$ is crucial for the well-behavedness of the method
many methods have been proposed to select the bandwidth parameter for performing kernel density estimation，但是对于核密度估计（kernel density estimation）的问题，（the abscissa corresponds to a given dataset with an arbitrary distribution.）所以，KAF中，abscissa 通过一个网格grid选取
本文没有把 $\gamma$ 看作一个超参数，而是从经验上设定
$\gamma = \frac {1}{6\Delta^2}$
其中 $\Delta$ 是网格点（grid points）之间的距离
对于KAF，初始化的不同，会有性能的差异，所以可以通过对参数进行初始化来得到想要激励函数。如图：
这里写图片描述

On the initialization of the mixing coefficients

使用kernel ridge regression 来初始化 the mixing coefficients
$\alpha = (K + \varepsilon I)^{-1} t$
其中 $t = [t_1, t_2,..., t_D]^T$ 表示希望的得到的KAF的初始值，与字典元素相对应 $d = [d_1, d_2, ..., d_D]^T$ , $K$ 是通过 $t, d$ 得到的核矩阵。

这里写图片描述

Multi-dimensional kernel activation functions

对于2D-KAF，作用于pair of 激励函数上，而不是单个的，最终学习一个两维度的函数。
for each possible pair of activations $s = [s_k, s_{k+1}]^T$ ，可得到：

g (s) = \sum i = 1 D 2 α i k (s, d i)

$g(s) = \sum_{i=1}^{D^2} \alpha_i k(s, d_i)$

di d i $d_i$ 表示字典的第i个元素。
对于2D Gaussian kernel：

k(s,di)=exp{−γ∥s−di∥}22 k ( s , d i ) = e x p { − γ ‖ s − d i ‖ } 2 2 $k(s, d_i) = exp\{-\gamma \Vert s-d_i \Vert \} _2^2$

KAF: they are smooth over their entire domain, and their operations
can be implemented easily with a high degree of vectorization

不甘心的程序员

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文笔记 - kernel-based non-parametric activation functions for neural networks

abstractintroductionpreliminariesFixed activation functionsParametric adaptable activation functionsNon-parametric activation functionsAPL functions 分段线性插值函数spline functions 样条插值（SAF）maxou...
复制链接

扫一扫