Deep Learning相关概念

  • Epoch

One epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE [1].

  • Iteration

Iterations is the number of batches needed to complete one epoch [1].

Batch normalization [3]

Use B B B to denote a mini-batch of size m m m of the entire training set. The empirical mean and variance of B B B could thus be denoted as
μ B = 1 m ∑ i = 1 m x i , and  σ B 2 = 1 m ∑ i = 1 m ( x i − μ B ) 2 . {\mu _{B}=\frac {1}{m}\sum _{i=1}^{m}x_i, \text{and } \sigma _{B}^{2}={\frac {1}{m}}\sum _{i=1}^{m}(x_{i}-\mu _{B})^{2}}. μB=m1i=1mxi,and σB2=m1i=1m(xiμB)2.

For a layer of the network with d-dimensional input, x = ( x ( 1 ) , . . . , x ( d ) ) x=(x^{(1)},...,x^{(d)}) x=(x(1),...,x(d)), each dimension of its input is then normalized separately
x ^ i ( k ) = x i ( k ) − μ B ( k ) σ B ( k ) 2 + ϵ . \hat {x}_{i}^{(k)}=\frac{x_{i}^{(k)}-\mu _{B}^{(k)}}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon .}} x^i(k)=σB(k)2+ϵ. xi(k)μB(k)

To restore the representation power of the network, a transformation step then follows as
y i ( k ) = γ ( k ) x ^ i ( k ) + β ( k ) , y_{i}^{(k)}=\gamma ^{(k)}{\hat {x}}_{i}^{(k)}+\beta ^{(k)}, yi(k)=γ(k)x^i(k)+β(k),

where the parameters γ ( k ) \gamma ^{(k)} γ(k) and β ( k ) \beta ^{(k)} β(k) are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform B N γ ( k ) , β ( k ) : x 1... m ( k ) → y 1... m ( k ) BN_{\gamma ^{(k)},\beta ^{(k)}}:x_{1...m}^{(k)} \rightarrow y_{1...m}^{(k)} BNγ(k),β(k):x1...m(k)y1...m(k) called the Batch Normalizing transform. The output of the BN transform y ( k ) = B N γ ( k ) , β ( k ) ( x ( k ) ) y^{(k)}=BN_{\gamma ^{(k)},\beta ^{(k)}}(x^{(k)}) y(k)=BNγ(k),β(k)(x(k)) is then passed to other network layers, while the normalized output x ^ i ( k ) \hat {x}_{i}^{(k)} x^i(k) remains internal to the current layer.

Conditional batch normalization [4]

CBN instead learns to output new BN parameters γ ^ i , c \hat{\gamma}_{i,c} γ^i,c and β ^ i , c \hat{\beta}_{i,c} β^i,c as a function of some input x i \bm x_i xi:
γ i , c = f c ( x i ) ,   β i , c = h c ( x i ) , \gamma_{i,c} = f_c(\bm x_i),\ \beta_{i,c} = h_c(\bm x_i) , γi,c=fc(xi), βi,c=hc(xi),

where f f f and h h h are arbitrary functions such as neural networks. Thus, f f f and h h h can learn to control the distribution of CNN activations based on x i \bm x_i xi.

Combined with ReLU non-linearities, CBN empowers a conditioning model to manipulate feature maps of a target CNN by scaling them up or down, negating them, shutting them off, selectively thresholding them, and more. Each feature map is modulated independently, giving the conditioning model an exponential (in the number of feature maps) number of ways to affect the feature representation.

Rather than output γ ^ i , c \hat{\gamma}_{i,c} γ^i,c directly, [4] output Δ γ ^ i , c \Delta \hat{\gamma}_{i,c} Δγ^i,c, where:
γ ^ i , c = 1 + Δ γ ^ i , c , \hat{\gamma}_{i,c} = 1+\Delta \hat{\gamma}_{i,c} , γ^i,c=1+Δγ^i,c,

since initially zero-centred γ ^ i , c \hat{\gamma}_{i,c} γ^i,c can zero out CNN feature map activations and thus gradients.

Conditional normalization [7,8]
  • Conditional normalization aims to modulate the activation via a learned affine transformation conditioned on external data (e.g., an image of an artwork for capturing
    a specific style).
  • Conditional normalization methods include Conditional Batch Normalization for general visual questions answering on complex scenes such as VQA and GuessWhat (Dumoulin et al., 2017), Conditional Instance Normalization and Adaptive Instance Normalization (Huang & Belongie, 2017) for image stylization, Dynamic Layer Norm for speech recognition, and SPADE (Park et al., 2019).
  • Conditional normalization methods are widely used in the style transfer and image synthesis tasks, and also applied to align different data distributions for domain adaptation.
  • FiLM [8] can be viewed as a generalization of CN methods.
    CN - “replace the parameters of the feature-wise affine transformation typical in normalization layers”
    FiLM - “not strictly necessary for the affine transformation to occur directly after normalization”
对convolutional networks有效性的评论
  • Local connectivity can greatly reduce the number of parameters in the model, which inherently provides some form of build-in regularization.
  • The convolution operation has a direct filtering interpretation, where each feature map is convolved against input features to identify patterns as groups of pixels. Thus, the outputs of each convolutional layer correspond to important spatial features in the original input space and offer some robustness to simple transformations.
Translational invariance vs Translational equivariance [5,6]
  • Translation invariance means that the system produces exactly the same response, regardless of how its input is shifted. Equivariance means that the system works equally well across positions, but its response shifts with the position of the target. For example, a heat map of “face-iness” would have similar bumps at different positions.
  • Convolution provides translational equivariance (rather than translation invariance), meaning if an object in an image is at area A and through convolution a feature is detected at the output at area B, then the same feature would be detected when the object in the image is translated to A’. The position of the output feature would also be translated to a new area B’ based on the filter kernel size.
  • Translational Invariance is a result of the pooling operation (not the convolutional operation).
  • Additional info (see [5]): 1. The convolution operator commutes with respect to translation. 2. One approach to translation-invariant object recognition is via template-matching.
Optimisation over non-differentiable k k kNN operations

In DNNs, there is a common trick in computing the gradient of operations non-differentiable at some points, but differentiable elsewhere, such as Max-Pooling (top-1) and top-k. In forward computation pass, the index position of the max (or top-k) values are stored. While in the back propagation pass, the gradient is computed only with respect to these saved positions. This trick is implemented in modern deep learning frameworks such as tensorflow ( t f . n n . t o p k ( ) tf.nn.top_k() tf.nn.topk()) and pytorch.
——————                https://openreview.net/forum?id=SyVuRiC5K7 (Response to AnonReviewer 2)

参考文献

  1. https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
  2. Siamese Neural Networks for One-shot Image Recognition
  3. https://en.wikipedia.org/wiki/Batch_normalization
  4. Perez, Ethan, et al. “Learning visual reasoning without strong priors.” arXiv preprint arXiv:1707.03017 (2017).
  5. https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo
  6. Translational Invariance Vs Translational Equivariance. https://towardsdatascience.com/translational-invariance-vs-translational-equivariance-f9fbc8fca63a
  7. Tseng, Hung-Yu, et al. “Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation.” ICLR. 2019.
  8. Perez, Ethan, et al. “Film: Visual reasoning with a general conditioning layer.” AAAI. 2018.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值