每个batch中的元素单位大小相同,有点像归一化 优点: because of less covariate shift, learning rate可以设大一点less vanishing gradient problemsless sensitive to initialization