Issue 1: How to understand scale factor α \alpha α
In your paper, you proposed a scale factor α \alpha α which is used to replace the batch-calculated scaling parameters in original Batch Normalization. I got two questions about the use of α \alpha α
in your paper,
α
\alpha
α can be calculated by functions below:
(1)
α
=
m
a
x
(
S
h
i
f
t
(
L
m
i
n
/
L
)
,
1
)
\alpha = max(Shift(L_{min}/L),1) \tag 1
α=max(Shift(Lmin/L),1)(1)
(2)
S
h
i
f
t
(
x
)
=
2
r
o
u
n
d
(
l
o
g
2
x
)
Shift(x) = 2^{round(log_{2}x)} \tag 2
Shift(x)=2round(log2x)(2)
(3)
L
m
i
n
=
β
σ
L_{min} = \beta \sigma \tag 3
Lmin=βσ(3)
Where
β
>
1
\beta >1
β>1 and
σ
(
k
)
=
2
1
−
k
,
k
∈
N
+
\sigma (k) = 2^{1-k},k\in N_{+}
σ(k)=21−k,k∈N+
(4)
L
=
m
a
x
(
6
/
n
i
n
,
L
m
i
n
)
L = max(\sqrt {6/n_{in}},L_{min}) \tag 4
L=max(6/nin,Lmin)(4)
So we can get that
(5)
L
m
i
n
/
L
≤
1
L_{min}/L \leq1 \tag 5
Lmin/L≤1(5)
(6)
S
h
i
f
t
(
L
m
i
n
/
L
)
≤
1
Shift(L_{min}/L) \leq 1 \tag 6
Shift(Lmin/L)≤1(6)
so
(7)
α
=
m
a
x
(
S
h
i
f
t
(
L
m
i
n
/
L
)
,
1
)
≡
1
\alpha = max(Shift(L_{min}/L) , 1) \equiv 1 \tag 7
α=max(Shift(Lmin/L),1)≡1(7)
Obviously,
α
\alpha
α should not always equals to 1.Or
(8)
a
q
=
Q
A
(
a
)
=
Q
(
a
/
α
,
k
A
)
a_{q}=Q_{A}(a)=Q(a/\alpha,k_{A}) \tag 8
aq=QA(a)=Q(a/α,kA)(8) will never be scaled
According to (3),(4),(7), α \alpha α is not relevant to current batch of data. So it’s not straight forward to understand why α \alpha α can take the place of variance which is highly relevant to current batch data.
Issue 2: Why the mean of activation can be hypothesized as 0
It’s written in the paper that:
Besides, we hypothesize that batch outputs of each hidden layer approximately have zero-mean, then …
But it seems that there is no futher explaination about the hypothesis.
Issue 3: How to shift the curve
Clearly, Shift(.) can change the mean of the blue curve. But it’s not stright forward that why the red curve remains excatly the same shape as the blue curve.