文章目录
前言
B站硬核课程加州大学伯克利分校 CS 194 并行程序设计导论 Introduction to Parallel Programming
https://www.bilibili.com/video/BV1QQ4y1o7rn?p=3&spm_id_from=pageDriver&vd_source=e0b3cb923dbb83260e674c055a5ec68f
Lecture3:Cache Oblivious Matmul
blocked matrix multiply
右矩阵数据复用,减少整个列读的次数
这样B只需要重复读N次,而不是j次
Recursive Matrix Multiplication
以2维mm递归把for循环展开,直接把memory读到cache里
- func嵌套不能太深到1*1,需要先切右矩阵确定最小计算单元
推荐读物:
https://cvw.cac.cornell.edu/vector/
Lecture6:shared memory(openMP)
Lecture7:roofline performance model
论文连接:
https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf
machine balance = 4
所以先评估当前程序是在machine balance的左边还是右边
最左边上移:
-
reduce attentable bandwise:
lack prefetch,
ignore NUMA(non-uniform memory access some memory closer to some cores) -
different badwidth for multiple level
Peak DP上移:
- 增加computation intensity
- bandwidth-reducing:
cache re-use(tiling)
compression techiques
Lecture8:Synchronization
避免shared memory的资源竞争。如果多线程同时往同一个shared memory里写,每个线程会独立把memory加载到L1 cache里,写锁导致无法并行
Lecture9:Load Balancing
- self scheduling: https://dl.acm.org/doi/pdf/10.1145/55364.55422
- Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
一个processor一个task queue的情况下,需要有调度,某个processor执行快了,需要从别的task queue里去拿一部分task来保证balance:
从random processor里拿task:
如果n == p的情况不优,只有task足够大才用这种方法,
找计算时间最长的task比较通用
Lecture10:Linearizablity
https://dl.acm.org/doi/pdf/10.5555/2385452
concurrent queue