对比chuanqi305的 train模型deploy模型 还能发现一件有趣的事情 —— deploy模型中的BN层和scale层都不见啦!! BN层是这样随随便便就能丢弃的么?没道理啊!

几经辗转,查阅资料之后发现,原来BN层是可以合并进前一层的卷积层或全连接层的,而且这还有利于减少预测用时。参考《Real-time object detection with YOLO - Converting to Metal


数学推导也很简单:假设图片为 x x x ,卷积层权重为 w w w 。那么对于卷积运算有, c o n v [ j ] = x [ i ] w [ 0 ] + x [ i + 1 ] w [ 1 ] + x [ i + 2 ] w [ 2 ] + … + x [ i + k ] w [ k ] + b conv[j] = x[i]w[0] + x[i+1]w[1] + x[i+2]w[2] + … + x[i+k]w[k] + b conv[j]=x[i]w[0]+x[i+1]w[1]+x[i+2]w[2]++x[i+k]w[k]+bBN层运算为, b n [ j ] = γ ( c o n v [ j ] − m e a n ) v a r i a n c e + β = γ ⋅ c o n v [ j ] v a r i a n c e − γ ⋅ m e a n v a r i a n c e + β bn[j] = \frac{\gamma (conv[j] - mean)}{\sqrt{variance}} + \beta = \frac{\gamma \cdot conv[j]}{\sqrt{variance}} - \frac{\gamma \cdot mean}{\sqrt{variance}} + \beta bn[j]=variance γ(conv[j]mean)+β=variance γconv[j]variance γmean+β代入 c o n v [ j ] conv[j] conv[j]变为, b n [ j ] = x [ i ] γ ⋅ w [ 0 ] v a r i a n c e + x [ i + 1 ] γ ⋅ w [ 1 ] v a r i a n c e + … + x [ i + k ] γ ⋅ w [ k ] v a r i a n c e + γ ⋅ b v a r i a n c e − γ ⋅ m e a n v a r i a n c e + β bn[j] = x[i] \frac{\gamma \cdot w[0]}{\sqrt{variance}} + x[i+1] \frac{\gamma \cdot w[1]}{\sqrt{variance}} + … + x[i+k] \frac{\gamma \cdot w[k]}{\sqrt{variance}} + \frac{\gamma \cdot b}{\sqrt{variance}} - \frac{\gamma \cdot mean}{\sqrt{variance}} + \beta bn[j]=x[i]variance γw[0]+x[i+1]variance γw[1]++x[i+k]variance γw[k]+variance γbvariance γmean+β两式对比可以得到, w n e w = γ ⋅ w v a r i a n c e w_{new} = \frac{\gamma \cdot w}{\sqrt{variance}} wnew=variance γw b n e w = β + γ ⋅ b v a r i a n c e − γ ⋅ m e a n v a r i a n c e = β + γ ( b − m e a n ) v a r i a n c e b_{new} = \beta + \frac{\gamma \cdot b}{\sqrt{variance}} - \frac{\gamma \cdot mean}{\sqrt{variance}} = \beta + \frac{\gamma (b-mean)}{\sqrt{variance}} bnew=β+variance γbvariance γmean=β+variance γ(bmean)注意,其中 γ \gamma γ m e a n mean mean v a r i a n c e variance variance β \beta β 都是训练出来的量,在预测阶段相当于一个常量。



However, there was a small wrinkle… YOLO uses a regularization technique called batch normalization after its convolutional layers.

The idea behind “batch norm” is that neural network layers work best when the data is clean. Ideally, the input to a layer has an average value of 0 and not too much variance. This should sound familiar to anyone who’s done any machine learning because we often use a technique called “feature scaling” or “whitening” on our input data to achieve this.

Batch normalization does a similar kind of feature scaling for the data in between layers. This technique really helps neural networks perform better because it stops the data from deteriorating as it flows through the network.

To give you some idea of the effect of batch norm, here is a histogram of the output of the first convolution layer without and with batch normalization:

Histogram of first layer output with and without batch norm

Batch normalization is important when training a deep network, but it turns out we can get rid of it at inference time. Which is a good thing because not having to do the batch norm calculations will make our app faster. And in any case, Metal does not have an MPSCNNBatchNormalization layer.

Batch normalization usually happens after the convolutional layer but before the activation function gets applied (a so-called “leaky” ReLU in the case of YOLO). Since both convolution and batch norm perform a linear transformation of the data, we can combine the batch normalization layer’s parameters with the weights for the convolution. This is called “folding” the batch norm layer into the convolution layer.

Long story short, with a bit of math we can get rid of the batch normalization layers but it does mean we have to change the weights of the preceding convolution layer.

A quick recap of what a convolution layer calculates: if x is the pixels in the input image and w is the weights for the layer, then the convolution basically computes the following for each output pixel:

out[j] = x[i]*w[0] + x[i+1]*w[1] + x[i+2]*w[2] + ... + x[i+k]*w[k] + b

This is a dot product of the input pixels with the weights of the convolution kernel, plus a bias value b.

And here’s the calculation performed by the batch normalization to the output of that convolution:

        gamma * (out[j] - mean)
bn[j] = ---------------------- + beta

It subtracts the mean from the output pixel, divides by the variance, multiplies by a scaling factor gamma, and adds the offset beta. These four parameters — mean, variance, gamma, and beta — are what the batch normalization layer learns as the network is trained.

To get rid of the batch normalization, we can shuffle these two equations around a bit to compute new weights and bias terms for the convolution layer:

           gamma * w
w_new = --------------

        gamma*(b - mean)
b_new = ---------------- + beta

Performing a convolution with these new weights and bias terms on input x will give the same result as the original convolution plus batch normalization.

Now we can remove this batch normalization layer and just use the convolutional layer, but with these adjusted weights and bias terms w_new and b_new. We repeat this procedure for all the convolutional layers in the network.

Note: The convolution layers in YOLO don’t actually use bias, so b is zero in the above equation. But note that after folding the batch norm parameters, the convolution layers do get a bias term.

Once we’ve folded all the batch norm layers into their preceding convolution layers, we can convert the weights to Metal. This is a simple matter of transposing the arrays (Keras stores them in a different order than Metal) and writing them out to binary files of 32-bit floating point numbers.

If you’re curious, check out the conversion script for more details. To test that the folding works the script creates a new model without batch norm but with the adjusted weights, and compares it to the predictions of the original model.


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
评论 3




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0