pytorch性能瓶颈检查

最新推荐文章于 2023-08-20 22:24:33 发布

Icoding_F2014

最新推荐文章于 2023-08-20 22:24:33 发布

阅读量1.6k

点赞数 2

分类专栏： pytorch 文章标签： leetcode windows github

本文链接：https://blog.csdn.net/jmh1996/article/details/121455625

版权

pytorch 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

看书的时候，看到pytorch居然自带了瓶颈检查的工具：torch.utils.bottleneck
用法：

python -m torch.utils.bottleneck  待测脚本路径

示例效果：

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.7.1+cu101 DEBUG compiled w/ CUDA 10.1
Running with Python 3.8 and CUDA 10.1.243

`pip3 list` truncated output:
numpy==1.19.2
torch==1.7.1+cu101
torch-cluster==1.5.9
torch-geometric==1.6.3
torch-scatter==2.0.6
torch-sparse==0.6.9
torch-spline-conv==1.2.1
torchaudio==0.7.2
torchsummary==1.5.1
torchvision==0.8.2+cu101
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         705723930 function calls (591115631 primitive calls) in 252.079 seconds

   Ordered by: internal time
   List reduced from 960 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
114583910/48   99.432    0.000  197.735    4.119 /usr/lib/python3.8/copy.py:128(deepcopy)
  9648/48   40.553    0.004  197.735    4.119 /usr/lib/python3.8/copy.py:200(_deepcopy_list)
229172740   25.338    0.000   25.338    0.000 {method 'get' of 'dict' objects}
       24   16.446    0.685   16.446    0.685 {method 'run_backward' of 'torch._C._EngineBase' objects}
114605951   11.720    0.000   11.720    0.000 {built-in method builtins.id}
114574262   11.361    0.000   11.361    0.000 /usr/lib/python3.8/copy.py:182(_deepcopy_atomic)
117258341    9.575    0.000    9.575    0.000 {method 'append' of 'list' objects}
      576    5.353    0.009    5.353    0.009 {built-in method equal}
  1658851    4.531    0.000    4.531    0.000 {method 'uniform' of 'numpy.random.mtrand.RandomState' objects}
      100    4.132    0.041    4.132    0.041 {built-in method __new__ of type object at 0x908780}
  1005442    3.727    0.000    3.727    0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
  2080353    2.508    0.000    2.508    0.000 {built-in method builtins.min}
     1918    1.864    0.001    2.046    0.001 /home3/jmh/scnet/TransformerFlow/data_builder.py:292(augment_expoential)
        1    1.658    1.658    2.691    2.691 {built-in method _pickle.load}
     9600    1.564    0.000    1.940    0.000 /home3/jmh/scnet/TransformerFlow/data_builder.py:132(refine_packet_length)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 EmbeddingBackward         6.68%     664.353ms         6.68%     664.353ms     664.353ms       0.000us           NaN       0.000us       0.000us             1  
          aten::embedding_backward         6.68%     664.345ms         6.68%     664.345ms     664.345ms       0.000us           NaN       0.000us       0.000us             1  
    aten::embedding_dense_backward         6.68%     664.340ms         6.68%     664.340ms     664.340ms       0.000us           NaN       0.000us       0.000us             1  
                 EmbeddingBackward         6.67%     663.371ms         6.67%     663.371ms     663.371ms       0.000us           NaN       0.000us       0.000us             1  
          aten::embedding_backward         6.67%     663.367ms         6.67%     663.367ms     663.367ms       0.000us           NaN       0.000us       0.000us             1  
    aten::embedding_dense_backward         6.67%     663.363ms         6.67%     663.363ms     663.363ms       0.000us           NaN       0.000us       0.000us             1  
                 EmbeddingBackward         6.66%     663.205ms         6.66%     663.205ms     663.205ms       0.000us           NaN       0.000us       0.000us             1  
          aten::embedding_backward         6.66%     663.200ms         6.66%     663.200ms     663.200ms       0.000us           NaN       0.000us       0.000us             1  
                 EmbeddingBackward         6.66%     663.198ms         6.66%     663.198ms     663.198ms       0.000us           NaN       0.000us       0.000us             1  
    aten::embedding_dense_backward         6.66%     663.196ms         6.66%     663.196ms     663.196ms       0.000us           NaN       0.000us       0.000us             1  
          aten::embedding_backward         6.66%     663.192ms         6.66%     663.192ms     663.192ms       0.000us           NaN       0.000us       0.000us             1  
    aten::embedding_dense_backward         6.66%     663.189ms         6.66%     663.189ms     663.189ms       0.000us           NaN       0.000us       0.000us             1  
                 EmbeddingBackward         6.66%     663.103ms         6.66%     663.103ms     663.103ms       0.000us           NaN       0.000us       0.000us             1  
          aten::embedding_backward         6.66%     663.098ms         6.66%     663.098ms     663.098ms       0.000us           NaN       0.000us       0.000us             1  
    aten::embedding_dense_backward         6.66%     663.095ms         6.66%     663.095ms     663.095ms       0.000us           NaN       0.000us       0.000us             1  
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 9.952s
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 EmbeddingBackward         6.71%     161.735ms         6.71%     161.735ms     161.735ms     511.000us         6.63%     511.000us     511.000us             1  
          aten::embedding_backward         6.71%     161.720ms         6.71%     161.720ms     161.720ms     508.500us         6.60%     508.500us     508.500us             1  
    aten::embedding_dense_backward         6.71%     161.708ms         6.71%     161.708ms     161.708ms     503.500us         6.53%     503.500us     503.500us             1  
                 EmbeddingBackward         6.67%     160.718ms         6.67%     160.718ms     160.718ms     550.000us         7.14%     550.000us     550.000us             1  
                 EmbeddingBackward         6.67%     160.708ms         6.67%     160.708ms     160.708ms     496.000us         6.44%     496.000us     496.000us             1  
                 EmbeddingBackward         6.67%     160.677ms         6.67%     160.677ms     160.677ms     496.000us         6.44%     496.000us     496.000us             1  
                 EmbeddingBackward         6.66%     160.661ms         6.66%     160.661ms     160.661ms     528.000us         6.85%     528.000us     528.000us             1  
                 EmbeddingBackward         6.66%     160.504ms         6.66%     160.504ms     160.504ms     494.000us         6.41%     494.000us     494.000us             1  
                 EmbeddingBackward         6.66%     160.453ms         6.66%     160.453ms     160.453ms     552.000us         7.16%     552.000us     552.000us             1  
                 EmbeddingBackward         6.66%     160.446ms         6.66%     160.446ms     160.446ms     496.000us         6.44%     496.000us     496.000us             1  
                 EmbeddingBackward         6.65%     160.314ms         6.65%     160.314ms     160.314ms     488.000us         6.33%     488.000us     488.000us             1  
                 EmbeddingBackward         6.65%     160.290ms         6.65%     160.290ms     160.290ms     512.000us         6.65%     512.000us     512.000us             1  
                 EmbeddingBackward         6.65%     160.238ms         6.65%     160.238ms     160.238ms     576.000us         7.48%     576.000us     576.000us             1  
                 EmbeddingBackward         6.65%     160.237ms         6.65%     160.237ms     160.237ms     480.000us         6.23%     480.000us     480.000us             1  
                 EmbeddingBackward         6.64%     160.183ms         6.64%     160.183ms     160.183ms     514.000us         6.67%     514.000us     514.000us             1  
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.411s
CUDA time total: 7.705ms

它会分析cpu的执行耗时，以及GPU的执行耗时。
从这份瓶颈报告中，我可以很轻易的发现，deepcopy是目前我的程序的一个瓶颈。
于是对深层次拷贝做优化。

Icoding_F2014

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pytorch性能瓶颈检查

看书的时候，看到pytorch居然自带了瓶颈检查的工具：torch.utils.bottleneck用法：python -m torch.utils.bottleneck 待测脚本路径示例效果：-------------------------------------------------------------------------------- Environment Summary--------------------------------------------------
复制链接

扫一扫