程序性能优化知识框架

置顶爱上一只柠檬的pig_head

已于 2022-09-18 10:23:45 修改

阅读量948

点赞数 3

分类专栏： CPU 优化人工智能文章标签：深度学习 pytorch

于 2020-02-21 14:04:58 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/zlgahu/article/details/104426881

版权

人工智能同时被 2 个专栏收录

9 篇文章

订阅专栏

5 篇文章

订阅专栏

1.计算机基础知识

1.1 内存&缓存

1.1.1 缓存

1） What is a “cache-friendly” code?

2）计算机缓存Cache以及Cache Line详解

3) 如何看一段代码对缓存的使用情况呢（思考）？

4） leading dimension

paper: http://changwanhong.com/publication/PLDI16.pdf

1.1.2 内存优化

1）TLB shootdown

Jemalloc解决TLB shootdown

Jemalloc tuning

time strace -qq -fefutex -c your_bin 检测 lock contention

mmap/ummap/madvice

4）内存优化

几个问题：

i) 内存碎片

ii) 多线程使用效率

iii) 各种内存分配策略的内存消耗，比如有的释放不及时可能导致内存消耗过大？

iv) 元数据本身对内存的消耗？

TCMalloc TCMalloc

a) 如何控制内存碎片

一次申请 n 个 page, 使得均分成 size-class 后内存随便的比例小于 12.5%

2）TCMalloc的三级缓存机制

ThreadCache, CentralCache, PageHeap

2. 性能优化

2.1. CPU 性能优化

1）线程的亲和性（Thread Affinity）

Thread Affinity Interface (Linux* and Windows*)

Python CPU affnity

2) MPI 分布式训练

Intel® MPI Library Developer Reference for Linux* OS (Beta)

Introduction to Groups and Communicators

Efficiency computition

Intel MPI developer guide

Intel® MPI Library 2019 Over Libfabric*

3） CPU 频率调整

cpupower frequency-info 查看cpu的频率设置信息

cpupower frequency-set -g performance 将CPU频率设置为 performance 模式，这种模式下性能较高，但是比较耗电

powersave模式下CPU以低频运行，性能差但是省电。

s-tui python包查看频率信息。

4） GEMM优化

a. demo一步步演示GEMM优化

https://github.com/flame/how-to-optimize-gemm

b. Anatomy of High-Performance Matrix Multiplication

讲述矩阵分块的不同方式

https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

c. Anatomy of High-Performance Many-Threaded Matrix Multiplication

讲述如何运用多线程进行GEMM加速，哪些维度适合多线程？多线程对cache有什么影响？

https://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

5） SIMD 优化

Improving performance with SIMD intrinsics in three use cases

6) 硬件信息查看

CPU all info: https://github.com/opcm/pcm

sudo dmidecode

eg: sudo dmidecode --type 17 可以查看memory 信息

ifconfig+ethtool 查看网卡带宽(ethtool/dmesghttps://www.cyberciti.biz/faq/howto-determine-ethernet-connection-speed/)

fi_info 查看支持哪些 FI_PROVIDER, e.g.psm3, sockets

查看cpu 频率：turbostat --interval 1 -c core_id_beg-core_end

查看IB（infinite bandwidth）信息：ibv_devinfo 或者 rdma -d link

ibv_devices

查看运行时bw: ethbw

磁盘信息: lsblk -o NAME,FSTYPE,LABEL,MOUNTPOINT,SIZE,MODEL

Linux 查看、挂载磁盘

2.2 性能分析工具

2.2.1 perf

2.2.2 valgrind

2.2.3 vtune

2.2.4 gperftools

Profiling with GPerfTools

conda install -c conda-forge gperftools ghostscrip graphviz libunwind

sudo yum install gv

3. pytorch

3.1 Tensor

1）PyTorch中的contiguous

2) Pytorch 分发机制

数据类型相关的分发机制

指令集相关的分发机制

3）Pytorch 的多线程管理

a. forward & backward 分别使用两套线程。对于线程的拓扑分布，此时应该保证同一次计算的forward线程和backward线程在同个cpu核上计算，这样才能获得较好的性能。

4）pytorch distributed training

WRITING DISTRIBUTED APPLICATIONS WITH PYTORCH

这个文档介绍了 multinode 的基础，比如 scatter, gather, allreduce等等，对于入门很有帮助。

DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED

这个文档则主要介绍 pytorch 提供的 multinode 通信接口，相当于是对上一个文档的 API 介绍

A Gentle Introduction to Multi GPU and Multi Node Distributed Training

DDP DISTRIBUTED DATA PARALLEL

model parallel vs data parallel

Intel MPI process PIN

5) pytorch 框架学习教程

PyTorch – Internal Architecture Tour

6) 性能分析

torch.autograd.profiler.profile 是一个上下文管理器，可以帮助统计每个 function 的 C++ kernel的执行时间。如果某个C++kernel的时间没有统计到，你在C++kernel中使用 RECORD_FUNCTION宏来enable profile。比如要统计is_contiguous函数的时间信息：RECORD_FUNCTION("is_contiguous", std::vector<c10::IValue>());

7) PyTorch TensorIterator

PyTorch TensorIterator Internals | Quansight Labs

https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator

8) PyTorch Dispatcher

Registering a Dispatched Operator in C++ — PyTorch Tutorials 1.10.0+cu102 documentation

9) view transpose permute and reshape

10) PyTorch Basics: Understanding Autograd and Computation Graphs

4. 深度学习

4.1 基础理论

1)Backward and SGD

2）读源码理解 Pytorch 的 autograd 机制

3）pytorch 内部实现机制

4) PyTorch 的 Autograd详解

PyTorch 101, Part 1: Understanding Graphs, Automatic Differentiation and Autograd

5) Cross Entropy, KL Divergence, and Maximum Likelihood Estimation

Lei Mao's Log Book – Cross Entropy, KL Divergence, and Maximum Likelihood Estimation

6) Label Smoothing

7) 如果统计算子的计算量

CNN 模型所需的计算力（flops）和参数（parameters）数量是怎么计算的？ - 知乎

https://github.com/Lyken17/pytorch-OpCounter/tree/master/thop

4.2 auto mix-presicion

IEEE 754 单精度浮点数转换

Float add & substraction

4.2 seq2seq

1) CS224N（1.31）Translation, Seq2Seq, Attention

5. C++

1）虚表

Understandig Virtual Tables in C++ | Pablo Arias

2) CRTP

Curiously recurring template pattern (CRTP)

6. 工程实践

6.1 编译

1）如何编译出更小的可执行文件? 需要设置哪些编译选项？

Removing Unused Code

lto 编译选项可以移除没用的函数，但是会增加编译时间，如下两个选项也可以达到移除的目的，但是每个函数都有一个secotion可能会引入额外的 memory overhead

通过symbol 的visibility 也可以控制：

linker - Limiting visibility of symbols when linking shared libraries - Stack Overflow

2) -ffast-math 极大提高浮点运算速度

3) GCC install

InstallingGCC - GCC Wiki

6.2 debug

1) core dump

The core dump file in Arch Linux | Nan Xiao's Blog

7, 书籍 & 课程

High_Performance_Computing_(Severance)

-ffunction-sections -Wl,--gc-sections

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。