torch 环境 训练后的参数比随机初始化的参数在CPU的推理速度慢

环境1:python2.7,torch0.2,

环境2:python3.6,torch1.0

在这两个环境下训练后的参数比随机初始化的参数在cpu上的推理速度慢。

随机初始化的参数在我的模型下推理速度:约2.9s

训练后的参数同样的模型推理速度:约6s多。同时表现出CPU利用率低。

查找了很久,没有找到原因。今天看到一个解释,备注一下

Trained weight

        shape: torch.Size([512, 1024, 3, 3])

        dtype: torch.float32

        any nans?: 0

        min: -0.07247687131166458

        max: 0.0953093096613884

        mean: -4.6253231289483665e-07

        std: 0.0008879730594344437

Good (random) weight

        shape: torch.Size([512, 1024, 3, 3])

        dtype: torch.float32

        any nans?: 0

        min: -0.0048497761599719524

        max: 0.004831577185541391

        mean: -1.2575401342473924e-07

        std: 0.001000691088847816

Input vector

        shape: torch.Size([1, 1024, 32, 32])

        dtype: torch.float32

        any nans?: 0

        min: -5.017491817474365

        max: 4.822595119476318

        mean: -0.0013921058271080256

        std: 1.0015257596969604

train weight is denormal!!!!

 

下面有一个解释

https://stackoverflow.com/questions/36781881/why-denormalized-floats-are-so-much-slower-than-other-floats-from-hardware-arch

quenstion:

Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems.

I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter?

In the former case, if denormals are intrinsically hardware-unfriendly, are there known non-IEEE-754 floating point representations that are also gapless near zero, but more convenient for hardware implementation?

answer:

On most x86 systems, the cause of slowness is that denormal values trigger an FP_ASSIST which is very costly as it switches to a micro-code flow (very much like a fault).

see for example - https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/487262

The reason why this is the case, is probably that the architects decided to optimize the HW for normal values by speculating that each value is normalized (which would be more common), and did not want to risk the performance of the frequent use case for the sake of rare corner cases. This speculation is usually true, so you only pay the penalty when you're wrong. These trade-offs are very common in CPU design since any investment in one case usually adds an overhead on the entire system.

In this case, if you were to design a system that tries to optimize all type of irregular FP values, you would have to either add HW to detect and record the state of each value after each operation (which would be multiplied by the number of physical FP registers, execution units, RS entries and so on - totaling in a significant number of transistors and wires. Alternatively, you would have to add some mechanism to check the value on read, which would slow you down when reading any FP value (even on the normal ones).

Furthermore, based on the type, you would need to perform some correction or not - on x86 this is the purpose of the assist code, but if you did not make a speculation, you would have to perform this flow conditionally on each value, which would already add a large chunk of that overhead on the common path.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值