High Performance Computing 综述

最新推荐文章于 2024-06-24 19:02:26 发布

晨光ABC

最新推荐文章于 2024-06-24 19:02:26 发布

阅读量871

点赞数

分类专栏： HPC 文章标签： GPU CUDA OpenCL SIMD

本文链接：https://blog.csdn.net/u011178262/article/details/126643661

版权

HPC 专栏收录该内容

3 篇文章

订阅专栏

本文档概述了并行计算的基础知识，包括异构计算、MPI消息传递接口、并发与多处理、CPU信息、ARM CPU特性、CPU基准测试、压力测试等。同时介绍了OpenMP、OpenACC、Intel TBB、GPU编程和软件优化的相关工具和教程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Overview

cggos
- hpc
- multicore_gpu_programming
StreamHPC is a software development company in parallel software for many-core processors.
The Supercomputing Blog
FastC++: Coding Cpp Efficiently

Heterogeneous Computing

Introduction to Parallel Computing

MPI (Message Passing Interface)

MPI Forum: the standardization forum for MPI
Open MPI: Open Source High Performance Computing.

Concurrency

task switching
hardware concurrency

Multi-Processing

IPC

Multi-Threading

POSIX C pthread
boost::thread
c++11 std::thread

CPU

CPU Info

8 commands to check cpu information on Linux:

/proc/cpuinfo: The /proc/cpuinfo file contains details about individual cpu cores.
lscpu: simply print the cpu hardware details in a user-friendly format
cpuid: fetches CPUID information about Intel and AMD x86 processors
nproc: just prints out the number of processing units available, note that the number of processing units might not always be the same as number of cores
dmidecode: displays some information about the cpu, which includes the socket type, vendor name and various flags
hardinfo: would produce a large report about many hardware parts, by reading files from the /proc directory
lshw -class processor: lshw by default shows information about various hardware parts, and the -class option can be used to pickup information about a specific hardware part
inxi: a script that uses other programs to generate a well structured easy to read report about various hardware components on the system

ARM CPU features

Runtime detection of CPU features on an ARMv8-A CPU

CPU Benchmark

Sysbench

Sysbench – Scriptable database and system performance benchmark, a cross-platform and multi-threaded benchmark tool

sysbench --test=cpu --cpu-max-prime=20000 --num-threads=4 run

htop

htop - an interactive process viewer for Unix
htop explained - Explanation of everything you can see in htop/top on Linux

压力测试

https://www.tecmint.com/linux-cpu-load-stress-test-with-stress-ng-tool/

cat /sys/class/thermal/thermal_zone0/temp

# stress
stress --cpu 4 --io 4 --vm 1 --vm-bytes 1G

CPU Instructions & Intrinsics

Compiler Intrinsics

Assembly

x86 Assembly
winasm: The x86 Assembly community and official home of WinAsm Studio and HiEditor
Easy Code Visual assembly IDE
0xAX/asm: Learning assembly for linux-x64

SIMD

Intel MMX & SSE

Intel Intrinsics Guide
SSE (Streaming SIMD Extentions)

ARM NEON

Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture extension for the Arm Cortex-A series and Cortex-R52 processors.

Compiler Options:

test ARM NEON

gcc -dM -E -x c /dev/null | grep -i -E "(SIMD|NEON|ARM)"

Raspberry Pi 3 Model B
- g++ options
```
-std=c++11 -O3 -march=native -mfpu=neon-vfpv4 -mfloat-abi=softfp -ffast-math
```
- for the compilation error error: ‘vfmaq_f32’ was not declared in this scope, you might add the option -mfpu=neon-vfpv4 to enable __ARM_FEATURE_FMA in arm_neon.h

Reference Books:

NEON Programmer’s Guide
ARM® NEON Intrinsics Reference

Converter

OpenMP

The OpenMP API specification for parallel programming, an Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

OpenMP有两种常用的并行开发形式: 一是通过简单的 fork/join 对串行程序并行化，二是采用 单程序多数据 对串行程序并行化。

OpenMP in CMakeLists.txt:

find_package(OpenMP)
if (OPENMP_FOUND)
    set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
    set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
    set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
endif()

OpenACC

OpenACC is a user-driven directive-based performance-portable parallel programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model.

Intel TBB

Intel Threading Building Blocks (TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability.

Intel IPP

GPU

GPU Benchmark

For the Raspberry Pi GPU benchmark, use the OpenGL 2.1 test that comes with GeeXLab
msalvaris/gpu_monitor: Monitor your GPUs whether they are on a single computer or in a cluster
Benchmark Your Graphics Card On Linux

watch -n 10 nvidia-smi       # 每隔10秒更新一下显卡

# on Android
watch -n 0.1 adb shell cat /sys/class/kgsl/kgsl-3d0/gpu_busy_percentage # 0.1s

Platforms

ARM MALI GPU
Nvidia GPU

Languages

CUDA vs OpenCL: Which should I use?

OpenCL

OpenCL™ (Open Computing Language) is the open, royalty-free standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms.

install OpenCL

# required: Ubuntu 16.04, nvidia GPU and nvidia driver installed
sudo apt-get install nvidia-prime nvidia-modprobe nvidia-opencl-dev
sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /usr/local/lib/libOpenCL.so