High Performance Computing 综述

本文档概述了并行计算的基础知识,包括异构计算、MPI消息传递接口、并发与多处理、CPU信息、ARM CPU特性、CPU基准测试、压力测试等。同时介绍了OpenMP、OpenACC、Intel TBB、GPU编程和软件优化的相关工具和教程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Overview

Heterogeneous Computing

MPI (Message Passing Interface)

  • MPI Forum: the standardization forum for MPI
  • Open MPI: Open Source High Performance Computing.

Concurrency

  • task switching
  • hardware concurrency

Multi-Processing

IPC

Multi-Threading

  • POSIX C pthread
  • boost::thread
  • c++11 std::thread

CPU

CPU Info

8 commands to check cpu information on Linux:

  • /proc/cpuinfo: The /proc/cpuinfo file contains details about individual cpu cores.

  • lscpu: simply print the cpu hardware details in a user-friendly format

  • cpuid: fetches CPUID information about Intel and AMD x86 processors

  • nproc: just prints out the number of processing units available, note that the number of processing units might not always be the same as number of cores

  • dmidecode: displays some information about the cpu, which includes the socket type, vendor name and various flags

  • hardinfo: would produce a large report about many hardware parts, by reading files from the /proc directory

  • lshw -class processor: lshw by default shows information about various hardware parts, and the -class option can be used to pickup information about a specific hardware part

  • inxi: a script that uses other programs to generate a well structured easy to read report about various hardware components on the system

ARM CPU features

CPU Benchmark

Sysbench

Sysbench – Scriptable database and system performance benchmark, a cross-platform and multi-threaded benchmark tool

sysbench --test=cpu --cpu-max-prime=20000 --num-threads=4 run

htop

  • htop - an interactive process viewer for Unix

  • htop explained - Explanation of everything you can see in htop/top on Linux

压力测试

https://www.tecmint.com/linux-cpu-load-stress-test-with-stress-ng-tool/

cat /sys/class/thermal/thermal_zone0/temp

# stress
stress --cpu 4 --io 4 --vm 1 --vm-bytes 1G

CPU Instructions & Intrinsics

Assembly

SIMD

Intel MMX & SSE
ARM NEON

Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture extension for the Arm Cortex-A series and Cortex-R52 processors.

Compiler Options:

  • test ARM NEON

    gcc -dM -E -x c /dev/null | grep -i -E "(SIMD|NEON|ARM)"
    
  • Raspberry Pi 3 Model B

    • g++ options
      -std=c++11 -O3 -march=native -mfpu=neon-vfpv4 -mfloat-abi=softfp -ffast-math
      
    • for the compilation error error: ‘vfmaq_f32’ was not declared in this scope, you might add the option -mfpu=neon-vfpv4 to enable __ARM_FEATURE_FMA in arm_neon.h

Reference Books:

  • NEON Programmer’s Guide
  • ARM® NEON Intrinsics Reference
Converter

OpenMP

The OpenMP API specification for parallel programming, an Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

OpenMP有两种常用的并行开发形式: 一是通过简单的 fork/join 对串行程序并行化,二是采用 单程序多数据 对串行程序并行化。

OpenMP in CMakeLists.txt:

find_package(OpenMP)
if (OPENMP_FOUND)
    set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
    set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
    set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
endif()

OpenACC

OpenACC is a user-driven directive-based performance-portable parallel programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model.

Intel TBB

Intel Threading Building Blocks (TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability.

Intel IPP

GPU

GPU Benchmark

watch -n 10 nvidia-smi       # 每隔10秒更新一下显卡

# on Android
watch -n 0.1 adb shell cat /sys/class/kgsl/kgsl-3d0/gpu_busy_percentage # 0.1s

Platforms

Languages

OpenCL

OpenCL™ (Open Computing Language) is the open, royalty-free standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms.

  • install OpenCL

    # required: Ubuntu 16.04, nvidia GPU and nvidia driver installed
    sudo apt-get install nvidia-prime nvidia-modprobe nvidia-opencl-dev
    sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /usr/local/lib/libOpenCL.so
    
  • build program

    g++ main.cpp -lOpenCL
    

CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).

Thrust

DSP

Software Optimization

并行计算(parallel computing)和高性能计算(high performance computing)是两个相关但不完全相同的概念。并行计算是指通过将计算任务分解成多个子任务并同时执行,以提高计算速度和效率的方法。而高性能计算则是指利用最先进的计算机硬件和软件技术,以实现更高的计算能力和性能。 在并行计算中,可以利用多个处理器、多个计算节点或者分布式计算机群集来同时处理多个子任务。这种并行方式可以极大地提高计算速度,特别是在处理大规模数据或复杂计算任务时非常有效。并行计算可以应用于各种领域,如科学计算、图像处理、机器学习、大数据分析等。 高性能计算则更多地涉及到优化计算系统的硬件和软件配置,以提供更高的计算能力和性能。这包括使用多核处理器、大容量内存、高速网络连接、并行文件系统等先进的计算机技术。高性能计算还涉及到使用并行编程模型和优化算法来改进计算效率。高性能计算通常适用于需要处理大规模数据或高复杂度计算任务的应用,如天气预报、基因组学、金融建模等。 总的来说,并行计算和高性能计算都旨在提高计算效率和性能,但侧重点略有不同。并行计算强调将一个计算任务拆分成多个并发执行的子任务,而高性能计算着重于通过优化计算系统的硬件和软件配置,以提供更大的计算能力和性能。两者常常是相辅相成的,并且在许多领域都得到广泛应用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

晨光ABC

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值