看书的时候,看到pytorch居然自带了瓶颈检查的工具:torch.utils.bottleneck
用法:
python -m torch.utils.bottleneck 待测脚本路径
示例效果:
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.7.1+cu101 DEBUG compiled w/ CUDA 10.1
Running with Python 3.8 and CUDA 10.1.243
`pip3 list` truncated output:
numpy==1.19.2
torch==1.7.1+cu101
torch-cluster==1.5.9
torch-geometric==1.6.3
torch-scatter==2.0.6
torch-sparse==0.6.9
torch-spline-conv==1.2.1
torchaudio==0.7.2
torchsummary==1.5.1
torchvision==0.8.2+cu101
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
705723930 function calls (591115631 primitive calls) in 252.079 seconds
Ordered by: internal time
List reduced from 960 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
114583910/48 99.432 0.000 197.735 4.119 /usr/lib/python3.8/copy.py:128(deepcopy)
9648/48 40.553 0.004 197.735 4.119 /usr/lib/python3.8/copy.py:200(_deepcopy_list)
229172740 25.338 0.000 25.338 0.000 {method 'get' of 'dict' objects}
24 16.446 0.685 16.446 0.685 {method 'run_backward' of 'torch._C._EngineBase' objects}
114605951 11.720 0.000 11.720 0.000 {built-in method builtins.id}
114574262 11.361 0.000 11.361 0.000 /usr/lib/python3.8/copy.py:182(_deepcopy_atomic)
117258341 9.575 0.000 9.575 0.000 {method 'append' of 'list' objects}
576 5.353 0.009 5.353 0.009 {built-in method equal}
1658851 4.531 0.000 4.531 0.000 {method 'uniform' of 'numpy.random.mtrand.RandomState' objects}
100 4.132 0.041 4.132 0.041 {built-in method __new__ of type object at 0x908780}
1005442 3.727 0.000 3.727 0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
2080353 2.508 0.000 2.508 0.000 {built-in method builtins.min}
1918 1.864 0.001 2.046 0.001 /home3/jmh/scnet/TransformerFlow/data_builder.py:292(augment_expoential)
1 1.658 1.658 2.691 2.691 {built-in method _pickle.load}
9600 1.564 0.000 1.940 0.000 /home3/jmh/scnet/TransformerFlow/data_builder.py:132(refine_packet_length)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
EmbeddingBackward 6.68% 664.353ms 6.68% 664.353ms 664.353ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_backward 6.68% 664.345ms 6.68% 664.345ms 664.345ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_dense_backward 6.68% 664.340ms 6.68% 664.340ms 664.340ms 0.000us NaN 0.000us 0.000us 1
EmbeddingBackward 6.67% 663.371ms 6.67% 663.371ms 663.371ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_backward 6.67% 663.367ms 6.67% 663.367ms 663.367ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_dense_backward 6.67% 663.363ms 6.67% 663.363ms 663.363ms 0.000us NaN 0.000us 0.000us 1
EmbeddingBackward 6.66% 663.205ms 6.66% 663.205ms 663.205ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_backward 6.66% 663.200ms 6.66% 663.200ms 663.200ms 0.000us NaN 0.000us 0.000us 1
EmbeddingBackward 6.66% 663.198ms 6.66% 663.198ms 663.198ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_dense_backward 6.66% 663.196ms 6.66% 663.196ms 663.196ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_backward 6.66% 663.192ms 6.66% 663.192ms 663.192ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_dense_backward 6.66% 663.189ms 6.66% 663.189ms 663.189ms 0.000us NaN 0.000us 0.000us 1
EmbeddingBackward 6.66% 663.103ms 6.66% 663.103ms 663.103ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_backward 6.66% 663.098ms 6.66% 663.098ms 663.098ms 0.000us NaN 0.000us 0.000us 1
aten::embedding_dense_backward 6.66% 663.095ms 6.66% 663.095ms 663.095ms 0.000us NaN 0.000us 0.000us 1
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 9.952s
CUDA time total: 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
EmbeddingBackward 6.71% 161.735ms 6.71% 161.735ms 161.735ms 511.000us 6.63% 511.000us 511.000us 1
aten::embedding_backward 6.71% 161.720ms 6.71% 161.720ms 161.720ms 508.500us 6.60% 508.500us 508.500us 1
aten::embedding_dense_backward 6.71% 161.708ms 6.71% 161.708ms 161.708ms 503.500us 6.53% 503.500us 503.500us 1
EmbeddingBackward 6.67% 160.718ms 6.67% 160.718ms 160.718ms 550.000us 7.14% 550.000us 550.000us 1
EmbeddingBackward 6.67% 160.708ms 6.67% 160.708ms 160.708ms 496.000us 6.44% 496.000us 496.000us 1
EmbeddingBackward 6.67% 160.677ms 6.67% 160.677ms 160.677ms 496.000us 6.44% 496.000us 496.000us 1
EmbeddingBackward 6.66% 160.661ms 6.66% 160.661ms 160.661ms 528.000us 6.85% 528.000us 528.000us 1
EmbeddingBackward 6.66% 160.504ms 6.66% 160.504ms 160.504ms 494.000us 6.41% 494.000us 494.000us 1
EmbeddingBackward 6.66% 160.453ms 6.66% 160.453ms 160.453ms 552.000us 7.16% 552.000us 552.000us 1
EmbeddingBackward 6.66% 160.446ms 6.66% 160.446ms 160.446ms 496.000us 6.44% 496.000us 496.000us 1
EmbeddingBackward 6.65% 160.314ms 6.65% 160.314ms 160.314ms 488.000us 6.33% 488.000us 488.000us 1
EmbeddingBackward 6.65% 160.290ms 6.65% 160.290ms 160.290ms 512.000us 6.65% 512.000us 512.000us 1
EmbeddingBackward 6.65% 160.238ms 6.65% 160.238ms 160.238ms 576.000us 7.48% 576.000us 576.000us 1
EmbeddingBackward 6.65% 160.237ms 6.65% 160.237ms 160.237ms 480.000us 6.23% 480.000us 480.000us 1
EmbeddingBackward 6.64% 160.183ms 6.64% 160.183ms 160.183ms 514.000us 6.67% 514.000us 514.000us 1
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.411s
CUDA time total: 7.705ms
它会分析cpu的执行耗时,以及GPU的执行耗时。
从这份瓶颈报告中,我可以很轻易的发现,deepcopy是目前我的程序的一个瓶颈。
于是对深层次拷贝做优化。