torch 环境训练后的参数比随机初始化的参数在CPU的推理速度慢

最新推荐文章于 2024-07-01 15:14:26 发布

zhxjlbs

最新推荐文章于 2024-07-01 15:14:26 发布

阅读量397

点赞数

文章标签： pytorch

本文链接：https://blog.csdn.net/zhxjlbs/article/details/105073478

版权

环境1：python2.7，torch0.2，

环境2：python3.6，torch1.0

在这两个环境下训练后的参数比随机初始化的参数在cpu上的推理速度慢。

随机初始化的参数在我的模型下推理速度：约2.9s

训练后的参数同样的模型推理速度：约6s多。同时表现出CPU利用率低。

查找了很久，没有找到原因。今天看到一个解释，备注一下

Trained weight

shape: torch.Size([512, 1024, 3, 3])

dtype: torch.float32

any nans?: 0

min: -0.07247687131166458

max: 0.0953093096613884

mean: -4.6253231289483665e-07

std: 0.0008879730594344437

Good (random) weight

shape: torch.Size([512, 1024, 3, 3])

dtype: torch.float32

any nans?: 0

min: -0.0048497761599719524

max: 0.004831577185541391

mean: -1.2575401342473924e-07

std: 0.001000691088847816

Input vector

shape: torch.Size([1, 1024, 32, 32])

dtype: torch.float32

any nans?: 0

min: -5.017491817474365

max: 4.822595119476318

mean: -0.0013921058271080256

std: 1.0015257596969604

train weight is denormal！！！！

下面有一个解释

https://stackoverflow.com/questions/36781881/why-denormalized-floats-are-so-much-slower-than-other-floats-from-hardware-arch

quenstion:

Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems.

I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter?

In the former case, if denormals are intrinsically hardware-unfriendly, are there known non-IEEE-754 floating point representations that are also gapless near zero, but more convenient for hardware implementation?

answer：

On most x86 systems, the cause of slowness is that denormal values trigger an FP_ASSIST which is very costly as it switches to a micro-code flow (very much like a fault).

see for example - https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/487262

The reason why this is the case, is probably that the architects decided to optimize the HW for normal values by speculating that each value is normalized (which would be more common), and did not want to risk the performance of the frequent use case for the sake of rare corner cases. This speculation is usually true, so you only pay the penalty when you're wrong. These trade-offs are very common in CPU design since any investment in one case usually adds an overhead on the entire system.

In this case, if you were to design a system that tries to optimize all type of irregular FP values, you would have to either add HW to detect and record the state of each value after each operation (which would be multiplied by the number of physical FP registers, execution units, RS entries and so on - totaling in a significant number of transistors and wires. Alternatively, you would have to add some mechanism to check the value on read, which would slow you down when reading any FP value (even on the normal ones).

Furthermore, based on the type, you would need to perform some correction or not - on x86 this is the purpose of the assist code, but if you did not make a speculation, you would have to perform this flow conditionally on each value, which would already add a large chunk of that overhead on the common path.

zhxjlbs

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
torch 环境训练后的参数比随机初始化的参数在CPU的推理速度慢

环境1：python2.7，torch0.2，环境2：python3.6，torch1.0在这两个环境下训练后的参数比随机初始化的参数在cpu上的推理速度慢。随机初始化的参数在我的模型下推理速度：约2.9s训练后的参数同样的模型推理速度：约6s多。同时表现出CPU利用率低。查找了很久，没有找到原因。今天看到一个解释，备注一下Trained weight s...
复制链接

扫一扫