Learning Roadmap:
Section 1: Intro to Parallel Programming & MUSA
- Deep Learning Ecosystem(摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/11/30-CSDN博客)
- Ubuntu+Driver+Toolkit+conda+pytorch+torch_musa环境安装(2024/11/24-Ubuntu Windows双系统安装 | 2024/11/30-GPU驱动&MUSA Toolkit安装)
- C/C++ Review(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/22-CSDN博客)
- GPU intros(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/25-CSDN博客)
- GPU硬件架构 (摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/26-CSDN博客)
- Write First Kernels (Here) (2024/11/27-线程层级 | 2024/11/28-First MUSA Kernel to Count Thread | 2024/12/02-向量相加 | 2024/12/03-向量相加(3D)| 2024/12/04-矩阵相乘)
- MUSA API
- Faster Matrix Multiplication
- Triton
- Pytorch Extensions(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/21-CSDN博客)
- MNIST Multilayer Perceptron
Section 2: Parallel Programming & MUSA in Depth
- Analyzing Parallel Program Performance on a Quad-Core CPU
- Scheduling Task Graphs on a Multi-Core CPU
- A Simple Renderer in MUSA
- Optimizing DNN Performance on DNN Accelerator Hardware
- llm.c
Ref:摩尔学院 | High-Performance Computing with GPUs | Stanford CS149 - Video | Stanford CS149 - Syllabus
Atomic Operation
Ref: High-Performance Computing with GPUs Chapter 5
下面代码展示了,GPU编程中的原子操作的概念,由于多个线程在同一个显存内进行书写,如果不进行原子化操作会出现脏读脏写的问题。方法incrementCounterNonAtomic执行了非线程安全的加法,方法incrementCounterAtomic则进行了线程安全的相加。结果中我们也可以清除的看到,通过线程安全的Atomic Add我们保证了计算结果的准确性。
代码地址
代码
#include <musa_runtime.h>
#include <stdio.h>
#define NUM_THREADS 1000
#define NUM_BLOCKS 1000
// Kernel without atomics (incorrect)
__global__ void incrementCounterNonAtomic(int* counter) {
// not locked
int old = *counter;
int new_value = old + 1;
// not unlocked
*counter = new_value;
}
// Kernel with atomics (correct)
__global__ void incrementCounterAtomic(int* counter) {
int a = atomicAdd(counter, 1);
}
int main() {
int h_counterNonAtomic = 0;
int h_counterAtomic = 0;
int *d_counterNonAtomic, *d_counterAtomic;
// Allocate device memory
musaMalloc((void**)&d_counterNonAtomic, sizeof(int));
musaMalloc((void**)&d_counterAtomic, sizeof(int));
// Copy initial counter values to device
musaMemcpy(d_counterNonAtomic, &h_counterNonAtomic, sizeof(int), musaMemcpyHostToDevice);
musaMemcpy(d_counterAtomic, &h_counterAtomic, sizeof(int), musaMemcpyHostToDevice);
// Launch kernels
incrementCounterNonAtomic<<<NUM_BLOCKS, NUM_THREADS>>>(d_counterNonAtomic);
incrementCounterAtomic<<<NUM_BLOCKS, NUM_THREADS>>>(d_counterAtomic);
// Copy results back to host
musaMemcpy(&h_counterNonAtomic, d_counterNonAtomic, sizeof(int), musaMemcpyDeviceToHost);
musaMemcpy(&h_counterAtomic, d_counterAtomic, sizeof(int), musaMemcpyDeviceToHost);
// Print results
printf("Non-atomic counter value: %d\n", h_counterNonAtomic);
printf("Atomic counter value: %d\n", h_counterAtomic);
// Free device memory
musaFree(d_counterNonAtomic);
musaFree(d_counterAtomic);
return 0;
}
编译
mcc 00_atomicAdd.mu -o atomicAdd -mtgpu -O2 -lmusart
./atomicAdd
输出结果
如图所示,原子相加保证了结果的正确性