GPGPUSim实验文档

最新推荐文章于 2024-07-01 20:00:19 发布

Wang121201

最新推荐文章于 2024-07-01 20:00:19 发布

阅读量1.1k

点赞数 19

文章标签： linux 系统架构缓存

本文链接：https://blog.csdn.net/Wang121201/article/details/135240201

版权

服务器配置

x86，Ubuntu20

GPGPUSim实验搭建

资源列表

cuda相关：cudatoolkit和gpu computing sdk
gpgpusim源码
rodinia基准程序，代码已修正可以直接编译

# source code 
wget http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/cudatoolkit_4.0.17_linux_64_ubuntu10.10.run
wget http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run
unzip gpgpusim-v3.2.zip && rm gpgpusim-v3.2.zip
unzip rodinia_3.1.zip && rm rodinia_3.1.zip

执行完工作目录内容如下：
/root
|—gpgpusim-v3.2：GPGPUSim源码
|—rodinia_3.1： Rodinia基准程序
|—cudatoolkit_4.0.17_linux_64_ubuntu10.10.run: cudatoolkit
|—gpucomputingsdk_4.0.17_linux.run: gpu computing sdk

环境配置

chmod +x cudatoolkit_4.0.17_linux_64_ubuntu10.10.run
./cudatoolkit_4.0.17_linux_64_ubuntu10.10.run
chmod +x gpucomputingsdk_4.0.17_linux.run
./gpucomputingsdk_4.0.17_linux.run
# add to env
echo 'export PATH=$PATH:/usr/local/cuda/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64' >> ~/.bashrc
echo 'export CUDA_INSTALL_PATH=/usr/local/cuda' >> ~/.bashrc
source ~/.bashrc
# check cuda install
nvcc -V
# if success, info below
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2011 NVIDIA Corporation
# Built on Thu_May_12_11:09:45_PDT_2011
# Cuda compilation tools, release 4.0, V0.2.1221

# 安装gcc 4.4 x86
# add source list
add-apt-repository 'deb http://archive.ubuntu.com/ubuntu/ trusty main'
add-apt-repository 'deb http://archive.ubuntu.com/ubuntu/ trusty universe'
apt update
apt-get install gcc-4.4 g++-4.4
# set and choose the gcc, g++ version
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 150
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.4 100
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 150
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.4 100
# gcc g++ 4.4
update-alternatives --config gcc
update-alternatives --config g++
# apt-get install build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev
# apt-get install doxygen graphviz
# apt-get install python-pmw python-ply python-numpy libpng-dev python-matplotlib
# apt-get install libxi-dev libxmu-dev freeglut3-dev
# install dependency
apt-get install build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev doxygen graphviz python-pmw python-ply python-numpy libpng-dev libxi-dev libxmu-dev freeglut3-dev
# gpgpusim source code and make
cd gpgpusim-v3.2/
source setup_environment && make
# make[1]: Leaving directory '/root/gpgpusim-v3.2/cuobjdump_to_ptxplus'

编译rodinia

cd ~/gpgpusim-v3.2/ && source setup_environment
cd ~/rodinia_3.1
./compile_rodinia.sh
./compile_rodinia_nvidia.sh
# the app under rodinia_3.1/bin/linux/cuda below
# backprop  bfs  dwt2d  gaussian  heartwall  hotspot  kmeans  lud  needle  pathfinder  streamcluster

测试模拟

cd bin/linux/cuda/
# copy GTX480 config files to current directory
cp ~/gpgpusim-v3.2/configs/GTX480/* ./
# excute bfs
./bfs ../../../data/bfs/graph4096.txt
# excute "cat ../../../cuda/test_app/run" to see how to excute the test_app

查看输出信息

# output redirect to log file
./bfs ../../../data/bfs/graph4096.txt > bfs.log 
# information about L2, L1 data, L1 instruction missrate
cat bfs.log | grep L2_total_cache_* | tail -n 12
cat bfs.log | grep L1D_total_cache_* | tail -n 10
cat bfs.log | grep L1I_total_cache_* | tail -n 10
# information about IPC
cat bfs.log | grep .*ipc | tail -n 2

参数配置说明

~/gpgpusim-v3.2/configs/下包含四种型号的GPU: GTX480 QuadroFX5600 QuadroFX5800 TeslaC2050，测试以GTX480为主，其中 gpgpusim.config包含该架构下的具体参数，实验要修改涉及的地方如下：

-gpgpu_cache:dl1  32:128:4,L:L:m:N:H,A:32:8,8
-gpgpu_shmem_size 49152
# The alternative configuration for fermi in case cudaFuncCachePreferL1 is selected
#-gpgpu_cache:dl1  64:128:6,L:L:m:N:H,A:32:8,8
#-gpgpu_shmem_size 16384
# 64 sets, each 128 bytes 8-way for each memory sub partition. This gives 786KB L2 cache
-gpgpu_cache:dl2 64:128:8,L:B:m:W:L,A:32:4,4:0,32
-gpgpu_cache:dl2_texture_only 0
-gpgpu_cache:il1 4:128:4,L:R:f:N:L,A:2:32,4
-gpgpu_tex_cache:l1 4:128:24,L:R:m:N:L,F:128:4,128:2
-gpgpu_const_cache:l1 64:64:2,L:R:f:N:L,A:2:32,4
# power model configs
# -power_simulation_enabled 1
# -gpuwattch_xml_file gpuwattch_gtx480.xml

GPGPUSim手册关于memory部分的描述：

-gpgpu_perfect_mem <0=off (default), 1=on> Enable perfect memory mode (zero memory latency with no cache misses)
-gpgpu_tex_cache:l1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<fifo_entry>
Texture cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random.
-gpgpu_const_cache:l1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:
<merge>,<mq> 
Constant cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random
-gpgpu_cache:il1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>
Shader L1 instruction cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random
-gpgpu_cache:dl1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> -- set to "none" for no DL1 --L1 data cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random
-gpgpu_cache:dl2 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>
Unified banked L2 data cache configuration. This specifies the configuration for the L2 cache bank in one of the memory partitions. The total L2 cache capacity = <nsets> x <bsize> x <assoc> x <# memory controller>.
-gpgpu_shmem_size <shared memory size,
default=16kB> Size of shared memory per SIMT core (aka shader core)
-gpgpu_shmem_warp_parts Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_flush_cache <0=off (default), 1=on> Flush cache at the end of each kernel call
-gpgpu_local_mem_map Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks Number of register banks (default = 8)

修改GPU型号时，注意将该型号下的配置文件拷贝到当前二进制文件的工作位置，
cp ~/gpgpusim-v3.2/configs/TeslaC2050/* ./

参考资源

GPGPUSim3.2用户手册
Fermi架构白皮书

基本实验

在GTX480配置下，修改缓存替换策略(FIFO,LRU,Random)，查看bfs程序的相关输出(如Miss rate， IPC)

# change the line
# replacement: LRU 
-gpgpu_cache:dl1  32:128:4,L:L:m:N:H,A:32:8,8
# replacement: LFU
-gpgpu_cache:dl1  32:128:4,F:L:m:N:H,A:32:8,8
# replacement: Ransom,  accur error
-gpgpu_cache:dl1  32:128:4,R:L:m:N:H,A:32:8,8

在GTX480配置下，修改L1数据缓存大小，查看bfs或其他程序输出。其中Fermi架构提供两种L1 data cache的配置。

# -gpgpu_cache:dl1  32:128:4,L:L:m:N:H,A:32:8,8
# -gpgpu_shmem_size 49152
# The alternative configuration for fermi in case cudaFuncCachePreferL1 is selected
-gpgpu_cache:dl1  64:128:6,L:L:m:N:H,A:32:8,8
-gpgpu_shmem_size 16384

在GTX480配置下，开启功耗模拟，查看程序执行功耗

# power model configs
-power_simulation_enabled 1
-gpuwattch_xml_file gpuwattch_gtx480.xml

当前目录生成一个gpgpusim_power_report的log文件，查看功耗信息
cat gpgpusim_power_report__*.log | grep power* | tail -n 6
4. 在 TeslaC2050配置下，查看输出
cp ~/gpgpusim-v3.2/configs/TeslaC2050/* ./
5. 在GTX480配置下，结合实验手册，修改一个感兴趣的结构的配置，查看程序输出
6. 在任意配置下，选择Rodinia其他2-3个可执行程序，查看输出信息。

./gaussian -s 64
./pathfinder 10000 100 20
 ./hotspot 512 2 2 ../../../data/hotspot/temp_512 ../../../data/hotspot/power_512 output.out
./needle 512 10
./backprop 16384

扩展实验(了解，可以试一下gdb调试)

在arm架构下，配置测试gpgpgusim
测试其他基准程序，如tango,polybench,mars
编译测试gpgpusim4.0的版本Mosaic
gdb调试gpgpusim生成的动态链接库

扩展部分解答

arm下无cuda4.0，无法实现配置

# arm sourcelist, only support gcc 4.8,  make source code 
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports focal main restricted universe multiverse
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports focal-updates main restricted universe multiverse
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports focal-security main restricted universe multiverse
# apt update
# apt-get install gcc-4.8 g++-4.8
# arm not support cuda 4.0, couldn't work

gpgpu-bench代码仓包含一些脚本，修复了mars等基准程序在gpgpusim下运行的报错

# clone patchs for benmarks
git clone https://github.com/KU-CSArch/gpgpu-bench.git
# excute the patch.sh to fix the benchmarks source code 
# source code can scp from  root@120.46.222.177:/root/source.orgfiles
# next according to the github readme to excute

测试Mosaic，包含TLB结构，且实现了多GPU模拟.具体在 SAFARI: Tools, Software, and Full Data Sets

apt install g++-4.8
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 100
update-alternatives --config g++
# choose version 4.8 
cd ~
git clone https://github.com/CMU-SAFARI/Mosaic.git
# /root/NVIDIA_GPU_Computing_SDK
# next according to the github readme to excute

gdb调试,查看代码执行过程

apt install gdb
gdb --args ./bfs ../../../data/bfs/graph4096.txt
b main
run
b cycle
i b
bt

调试动态链接库主要是查看函数调用过程，分析模拟器执行过程。gdb基本的命令操作参考如下：

命令               简写形式         说明
backtrace          bt、where       显示backtrace
break              b               设置断点
continue           c、cont         继续执行
delete             d               删除断点
finish                             运行到函数结束
info breakpoints                   显示断点信息
next               n               执行下一行
print              p               显示表达式
run                r               运行程序
step               s               一次执行一行，包括函数内部
x                                  显示内存内容
until              u               执行到指定行
其他命令
directory          dir             插入目录
disable            dis             禁用断点
down               do              在当前调用的栈帧中选择要显示的栈帧
edit               e               编辑文件或者函数
frame              f               选择要显示的栈帧
forward-search     fo              向前搜索
generate-core-file gcore           生成内核转存储
help                h              显示帮助一览
info                i              显示信息
list                l              显示函数或行
nexti               ni             执行下一行(以汇编代码为单位)
print-object        po             显示目标信息
sharelibrary        share          加载共享的符号
stepi               si             执行下一行