Radinia
NVCC version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:CB:00.0 Off | 0 |
| N/A 25C P0 53W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
1. b+tree
WG size of kernel 1 & 2 = 256
Selecting device 0
Input File: ../../data/b+tree/mil.txt
Command File: ../../data/b+tree/command.txt
Command Buffer:
j 6000 3000
k 10000
Getting input from file ../../data/b+tree/mil.txt...
Transforming data to a GPU suitable structure...
Tree transformation took 0.083823
Waiting for command
>
******command: j count=6000, rSize=6000
knodes_elem=7874, knodes_unit_mem=2068, knodes_mem=16283432
**of blocks = 6000, # of threads/block = 256 (ensure that device can handle)**
Time spent in different stages of GPU_CUDA KERNEL:
0.079342000186 s, 95.179946899414 % : GPU: SET DEVICE / DRIVER INIT
0.000395000010 s, 0.473848342896 % : GPU MEM: ALO
0.003152000019 s, 3.781190156937 % : GPU MEM: COPY IN
0.000061999999 s, 0.074376203120 % : GPU: KERNEL
0.000048999998 s, 0.058781191707 % : GPU MEM: COPY OUT
0.000360000005 s, 0.431861817837 % : GPU MEM: FRE
Total time:
0.083360001445 s
> > > > > > > > > > > >
******command: k count=10000
records_elem=1000000, records_unit_mem=4, records_mem=4000000
knodes_elem=7874, knodes_unit_mem=2068, knodes_mem=16283432
**of blocks = 10000, # of threads/block = 256 (ensure that device can handle)**
Time spent in different stages of GPU_CUDA KERNEL:
0.000007000000 s, 0.140731811523 % : GPU: SET DEVICE / DRIVER INIT
0.000514000014 s, 10.333735466003 % : GPU MEM: ALO
0.003879999975 s, 78.005630493164 % : GPU MEM: COPY IN
0.000059000002 s, 1.186168074608 % : GPU: KERNEL
0.000032000000 s, 0.643345415592 % : GPU MEM: COPY OUT
0.000482000003 s, 9.690389633179 % : GPU MEM: FRE
Total time:
0.004974000156 s
2.cfd
3.bfs:
Reading File
Read File
Copied Everything to GPU memory
Start traversing the tree
Kernel Executed 10 times
Result stored in result.txt
Init: 0.000000
MemAlloc: 0.000000
HtoD: 255.552017
Exec: 0.275000
DtoH: 0.281000
Close: 0.217000
Total: 266.488007
4.CFD:
makefile-> compute_80 :
#!/bin/bash
#定义多个 CFD 程序
CFD_PROGRAMS=("euler3d" "pre_euler3d" "momentum")
echo "There are three datasets:"
#对每个 CFD 程序运行所有数据集
for CFD_PROGRAM in "${CFD_PROGRAMS[@]}"; do
echo "Running with $CFD_PROGRAM:"
./"$CFD_PROGRAM" ../../data/cfd/fvcorr.domn.097K
./"$CFD_PROGRAM" ../../data/cfd/fvcorr.domn.193K
./"$CFD_PROGRAM" ../../data/cfd/missile.domn.0.2M
echo "Done with $CFD_PROGRAM."
done
There are three datasets:
Running with euler3d:
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name: NVIDIA A100-SXM4-40GB
Starting...
7.84005e-05 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name: NVIDIA A100-SXM4-40GB
Starting...
0.000109652 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name: NVIDIA A100-SXM4-40GB
Starting...
0.000209228 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Done with euler3d.
Running with pre_euler3d:
Name: NVIDIA A100-SXM4-40GB
Starting...
0.000141933 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Name: NVIDIA A100-SXM4-40GB
Starting...
0.000265089 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Name: NVIDIA A100-SXM4-40GB
Starting...
0.000335518 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Done with pre_euler3d.
Running with momentum:
run: line 16: ./momentum: Permission denied
run: line 17: ./momentum: Permission denied
run: line 18: ./momentum: Permission denied
Done with momentum.
5. dwt2d
Makefile-> :
# NVCC Options
NVCCFLAGS += -arch sm_80 -keep
result:
Using device 0: NVIDIA A100-SXM4-40GB
Source file: 192.bmp
Dimensions: 192x192
Components count: 3
Bit depth: 8
DWT levels: 3
Forward transform: 1
9/7 transform: 0
Loading ipnput: ../../data/dwt2d/192.bmp
precteno 110592, inputsize 110592
*** 3 stages of 2D forward DWT:
sliding steps = 2 , gx = 3 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 2 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 1 , gy = 6
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:
sliding steps = 2 , gx = 3 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 2 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 1 , gy = 6
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:
sliding steps = 2 , gx = 3 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 2 , gy = 12
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 1 , gx = 1 , gy = 6
fdwt53Kernel in launchFDWT53Kernel has finished
Writing to 192.bmp.dwt.r (192 x 192)
Writing to 192.bmp.dwt.g (192 x 192)
Writing to 192.bmp.dwt.b (192 x 192)
192.bmp.dwt.b common.ptx components.ptx dwt.h fdwt53.ptx main.cpp4.ii rdwt53.cpp4.ii rdwt97.cudafe1.gpu
192.bmp.dwt.g common.sm_80.cubin components.sm_80.cubin dwt_kernel.c.copy fdwt53.sm_80.cubin main.cu rdwt53.cudafe1.c rdwt97.cudafe1.stub.c
192.bmp.dwt.r components.cpp1.ii dwt2d dwt.module_id fdwt97.cpp1.ii main.cudafe1.c rdwt53.cudafe1.cpp rdwt97.fatbin
autorun.sh components.cpp4.ii dwt.cpp1.ii dwt.ptx fdwt97.cpp4.ii main.cudafe1.cpp rdwt53.cudafe1.gpu rdwt97.fatbin.c
common.cpp1.ii components.cu dwt.cpp4.ii dwt.sm_80.cubin fdwt97.cudafe1.c main.cudafe1.gpu rdwt53.cudafe1.stub.c rdwt97.module_id
common.cpp4.ii components.cudafe1.c dwt.cu fdwt53.cpp1.ii fdwt97.cudafe1.cpp main.cudafe1.stub.c rdwt53.fatbin rdwt97.ptx
common.cudafe1.c components.cudafe1.cpp dwt_cuda fdwt53.cpp4.ii fdwt97.cudafe1.gpu main.cu.o rdwt53.fatbin.c rdwt97.sm_80.cubin
common.cudafe1.cpp components.cudafe1.gpu dwt.cudafe1.c fdwt53.cudafe1.c fdwt97.cudafe1.stub.c main.fatbin rdwt53.module_id README
common.cudafe1.gpu components.cudafe1.stub.c dwt.cudafe1.cpp fdwt53.cudafe1.cpp fdwt97.fatbin main.fatbin.c rdwt53.ptx result.txt
common.cudafe1.stub.c components.cu.o dwt.cudafe1.gpu fdwt53.cudafe1.gpu fdwt97.fatbin.c main.module_id rdwt53.sm_80.cubin run.sh
common.fatbin components.fatbin dwt.cudafe1.stub.c fdwt53.cudafe1.stub.c fdwt97.module_id main.ptx rdwt97.cpp1.ii
common.fatbin.c components.fatbin.c dwt.cu.o fdwt53.fatbin fdwt97.ptx main.sm_80.cubin rdwt97.cpp4.ii
common.h components.h dwt.fatbin fdwt53.fatbin.c fdwt97.sm_80.cubin Makefile rdwt97.cudafe1.c
common.module_id components.module_id dwt.fatbin.c fdwt53.module_id main.cpp1.ii rdwt53.cpp1.ii rdwt97.cudafe1.cpp
Using device 0: NVIDIA A100-SXM4-40GB
Source file: rgb.bmp
Dimensions: 1024x1024
Components count: 3
Bit depth: 8
DWT levels: 3
Forward transform: 1
9/7 transform: 0
Loading ipnput: ../../data/dwt2d/rgb.bmp
precteno 3145728, inputsize 3145728
*** 3 stages of 2D forward DWT:
sliding steps = 9 , gx = 6 , gy = 15
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 5 , gx = 4 , gy = 13
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 3 , gx = 4 , gy = 11
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:
sliding steps = 9 , gx = 6 , gy = 15
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 5 , gx = 4 , gy = 13
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 3 , gx = 4 , gy = 11
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:
sliding steps = 9 , gx = 6 , gy = 15
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 5 , gx = 4 , gy = 13
fdwt53Kernel in launchFDWT53Kernel has finished
sliding steps = 3 , gx = 4 , gy = 11
fdwt53Kernel in launchFDWT53Kernel has finished
Writing to rgb.bmp.dwt.r (1024 x 1024)
Writing to rgb.bmp.dwt.g (1024 x 1024)
Writing to rgb.bmp.dwt.b (1024 x 1024)
使用设备:NVIDIA A100-SXM4-40GB
输入图像
- 第一幅图像是
192.bmp
,尺寸为 192x192,包含 3 个颜色通道(RGB),每个通道的位深度为 8 位。 - 第二幅图像是
rgb.bmp
,尺寸为 1024x1024,同样包含 3 个颜色通道(RGB),位深度为 8 位。
DWT 计算参数
- DWT 层级(
DWT levels
):3 层 - 前向变换(
Forward transform
):启用 - 选择了 5/3 小波变换(
9/7 transform: 0
意味着使用 5/3 变换,而不是 9/7 变换)
*DWT 分析
每个图像经历了 3 个阶段的二维前向 DWT 变换,程序分步骤显示了每一层中滑动步长(sliding steps
)和网格划分参数 gx
和 gy
:
- 滑动步长:描述在 DWT 过程中,图像数据如何在 GPU 上分块处理。
gx
和gy
:表示每个阶段在 x 和 y 方向上网格的划分。它们用于决定 CUDA 核函数(Kernel)如何在 GPU 上进行并行计算。
以 192.bmp
为例:
- 第一阶段:滑动步长为 2,x 方向分为 3 块,y 方向分为 12 块。
- 第二阶段:滑动步长为 1,x 方向分为 2 块,y 方向分为 12 块。
- 第三阶段:滑动步长为 1,x 方向分为 1 块,y 方向分为 6 块。
类似的步骤也应用于较大尺寸的 rgb.bmp
图像,其中每一层都展示了不同的滑动步长和网格划分。
Kernel 执行
每个阶段的 DWT 变换由 fdwt53Kernel
CUDA 核函数执行,程序多次显示 fdwt53Kernel in launchFDWT53Kernel has finished
,表示每个核函数的执行完成。
6.gaussian
makefile ->:
release: $(SRC)
$(CC) $(KERNEL_DIM) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
clang: $(SRC)
clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING
res:
WG size of kernel 1 = 512, WG size of kernel 2= 4 X 4
Total Device found: 1
Device Name - NVIDIA A100-SXM4-40GB
Total Global Memory - 41485888 KB
Shared memory available per block - 48 KB
Number of registers per thread block - 65536
Warp size in threads - 32
Memory Pitch - 2147483647 bytes
Maximum threads per block - 1024
Maximum Thread Dimension (block) - 1024 1024 64
Maximum Thread Dimension (grid) - 2147483647 65535 65535
Total constant memory - 65536 bytes
CUDA ver - 8.0
Clock rate - 1410000 KHz
Texture Alignment - 512 bytes
Device Overlap - Allowed
Number of Multi processors - 108
Read file from ../../data/gaussian/matrix4.txt
Time total (including memory transfers) 0.161765 sec
Time for CUDA kernels: 0.000088 sec
WG size of kernel 1 = 512, WG size of kernel 2= 4 X 4
Total Device found: 1
Device Name - NVIDIA A100-SXM4-40GB
Total Global Memory - 41485888 KB
Shared memory available per block - 48 KB
Number of registers per thread block - 65536
Warp size in threads - 32
Memory Pitch - 2147483647 bytes
Maximum threads per block - 1024
Maximum Thread Dimension (block) - 1024 1024 64
Maximum Thread Dimension (grid) - 2147483647 65535 65535
Total constant memory - 65536 bytes
CUDA ver - 8.0
Clock rate - 1410000 KHz
Texture Alignment - 512 bytes
Device Overlap - Allowed
Number of Multi processors - 108
Create matrix internally in parse, size = 16
Time total (including memory transfers) 0.073124 sec
Time for CUDA kernels: 0.000326 sec
设备信息
- 设备名称:NVIDIA A100-SXM4-40GB
- 总全局内存:41485888 KB(约 41 GB)
- 每个块可用的共享内存:48 KB
- 每个线程块的寄存器数:65536
- Warp 大小:32 线程
- 每个块的最大线程数:1024
- 最大线程维度(块):1024x1024x64
- 最大线程维度(网格):2147483647x65535x65535
- 常量内存:65536 字节(64 KB)
- CUDA 版本:8.0
- 时钟频率:1410000 KHz(1.41 GHz)
- 纹理对齐:512 字节
- 多处理器数量:108(表示该 GPU 具有 108 个 SM,多流多处理器)
工作组(Workgroup)大小
- 内核 1 的工作组大小:512
- 内核 2 的工作组大小:4x4(即 16 线程)
执行时间
-
总时间(包括内存传输)
- 第一次运行:0.161765 秒
- 第二次运行:0.073124 秒
-
CUDA 内核执行时间
- 第一次运行:0.000088 秒
- 第二次运行:0.000326 秒
总体上看,内核执行时间非常短,意味着计算部分的工作量较小,主要的时间花费在数据传输和其他非计算部分上。两次运行的时间差异主要可能是由于输入数据的不同,或者缓存影响。
7.heartwall
makefile ->
# compile main function file into object (binary)
main.o: main.cu kernel.cu define.c
nvcc $(OUTPUT) $(KERNEL_DIM) main.cu -I./AVI -c -O3 -keep
res:
WG size of kernel = 256
frame progress: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
8.hotspot
makefile ->
release: $(SRC)
$(CC) $(KERNEL_DIM) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
enum: $(SRC)
$(CC) $(KERNEL_DIM) -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
debug: $(SRC)
$(CC) $(KERNEL_DIM) -g $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
debugenum: $(SRC)
$(CC) $(KERNEL_DIM) -g -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
res:
WG size of kernel = 16 X 16
pyramidHeight: 2
gridSize: [512, 512]
border:[2, 2]
blockGrid:[43, 43]
targetBlock:[12, 12]
Start computing the transient temperature
Ending simulation
9.hotspot3d
10.huffman
mak---->
NVCC_OPTS=-O3 -arch=sm_80 -Xcompiler -m64 -g -G -keep
res:
CUDA initialized.
CUDA! Starting VLC Tests!
Parameters: num_elements: 262144, num_blocks: 1024, num_block_threads: 256
Time to generate: 0.3 ms
../../data/huffman/test1024_H2.206587175259.in, 1048576 bytes, entropy 2.206587
CPU Encoding time (CPU): 12.941000 (ms)
CPU Encoded to 291334 [B]
GPU Encoding time (SM64HUFF): 0.183190 (ms)
Num_blocks to be passed to scan is 1024.
Comparing vectors:
PASS! vectors are matching!
11.hybridsort
12.kmeans
gcc -g -fopenmp -O2 cluster.o getopt.o kmeans.o kmeans_clustering.o kmeans_cuda.o rmse.o -o kmeans -L/usr/local/cuda/lib64 -lcuda -lcudart -lm
gcc: error: kmeans.o: No such file or directory
Makefile:26: recipe for target 'kmeans' failed
make: *** [kmeans] Error 1
13.lavemd
mak:
# OMP_FLAG = -Xcompiler paste_one_here
CUDA_FLAG = -arch sm_80 -keep
res:
thread block size of kernel = 128
Configuration used: boxes1d = 10
Time spent in different stages of GPU_CUDA KERNEL:
0.411684006453 s, 98.182235717773 % : GPU: SET DEVICE / DRIVER INIT
0.000730999978 s, 0.174335688353 % : GPU MEM: ALO
0.001612000051 s, 0.384444773197 % : GPU MEM: COPY IN
0.003311000066 s, 0.789638161659 % : GPU: KERNEL
0.001444999943 s, 0.344617068768 % : GPU MEM: COPY OUT
0.000523000024 s, 0.124729916453 % : GPU MEM: FRE
Total time:
0.419306010008 s
14. leukocyte
./meschach_lib/meschach.a -L/usr/local/cuda/lib64 -lm -lcuda -lcudart
gcc: error: avilib.o: No such file or directory
gcc: error: find_ellipse.o: No such file or directory
gcc: error: track_ellipse.o: No such file or directory
Makefile:32: recipe for target 'leukocyte' failed
make[1]: *** [leukocyte] Error 1
make[1]: Leaving directory '/home/u200810220/cuda/rodinia/gpu-rodinia/cuda/leukocyte/CUDA'
Makefile:4: recipe for target 'CUDA/leukocyte' failed
make: *** [CUDA/leukocyte] Error 2
15.lud
cuda-make:
NVCC = nvcc -keep
DEFS += \
-DGPU_TIMER \
$(SPACE)
NVCCFLAGS += -I../common \
-O3 \
-use_fast_math \
-arch=sm_80 \
-lm \
$(SPACE)
res:
WG size of kernel = 16 X 16
Generate input matrix internally, size =256
Creating matrix internally size=256
Before LUD
Time consumed(ms): 0.877000
After LUD
16.mummergpu
suffix-tree.cpp:1764:26: error: ‘read’ was not declared in this scope
while ((bytes_read = read(qfile, buf, sizeof(buf))) != 0)
^~~~
suffix-tree.cpp:1764:26: note: suggested alternative: ‘fread’
while ((bytes_read = read(qfile, buf, sizeof(buf))) != 0)
^~~~
fread
suffix-tree.cpp:1807:34: error: ‘lseek’ was not declared in this scope
off_t seek = lseek(qfile, -(bytes_read - i), SEEK_CUR);
^~~~~
suffix-tree.cpp:1807:34: note: suggested alternative: ‘seek’
off_t seek = lseek(qfile, -(bytes_read - i), SEEK_CUR);
^~~~~
seek
suffix-tree.cpp:1715:33: warning: unused parameter ‘rc’ [-Wunused-parameter]
bool rc)
^~
suffix-tree.cpp: In function ‘int addMatchToBuffer(int, int, int)’:
suffix-tree.cpp:2180:1: warning: no return statement in function returning non-void [-Wreturn-type]
}
^
suffix-tree.cpp: In function ‘void printAlignments(ReferencePage*, Alignment*, char*, int, TextureAddress, int, int, int, bool, bool)’:
suffix-tree.cpp:2491:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (printParentId == matchNodeId)
~~~~~~~~~~~~~~^~~~~~~~~~~~~~
suffix-tree.cpp:2631:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (cid == matchNodeId)
~~~~^~~~~~~~~~~~~~
Makefile:114: recipe for target 'obj/release/suffix-tree.cpp_o' failed
make: *** [obj/release/suffix-tree.cpp_o] Error 1
~~~~~~~~~~~~~~
17. myocyte
mak:
# link objects(binaries) together
myocyte.out: main.o
nvcc main.o \
-I/usr/local/cuda/include \
-L/usr/local/cuda/lib \
-lm -lcuda -lcudart \
-keep -o myocyte.out
res:
Time spent in different stages of the application:
0.000000000000 s, 0.000000000000 % : SETUP VARIABLES
1.817788958549 s, 84.736045837402 % : ALLOCATE CPU MEMORY AND GPU MEMORY
0.032156001776 s, 1.498948574066 % : READ DATA FROM FILES
0.295287013054 s, 13.764773368835 % : RUN COMPUTATION
0.000005000000 s, 0.000233074476 % : FREE MEMORY
Total time:
2.145236968994 s
17. nn
LOCAL_CC = gcc -g -O3 -Wall
CC := $(CUDA_DIR)/bin/nvcc -keep
res:
1988 12 27 0 18 TONY 30.0 89.8 113 39 --> Distance=0.199997
1980 10 22 18 3 ISAAC 30.1 90.4 110 778 --> Distance=0.412312
1997 11 14 12 24 HELENE 30.5 89.8 134 529 --> Distance=0.538515
2003 8 27 12 10 TONY 29.9 89.4 160 286 --> Distance=0.608275
1974 12 22 18 24 JOYCE 30.6 89.9 80 593 --> Distance=0.608276
18.nw
release: $(SRC)
$(CC) ${KERNEL_DIM} $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
clang: $(SRC)
clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING -keep
enum: $(SRC)
$(CC) ${KERNEL_DIM} -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
debug: $(SRC)
$(CC) ${KERNEL_DIM} -g $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
debugenum: $(SRC)
$(CC) ${KERNEL_DIM} -g -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep
clean: $(SRC)
rm -f $(EXE) $(EXE).linkinfo result.txt
19.particlefiltter
#makefile
include ../../common/make.config
CC := $(CUDA_DIR)/bin/nvcc -keep
INCLUDE := $(CUDA_DIR)/include
all: naive float
naive: ex_particle_CUDA_naive_seq.cu
$(CC) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -lcuda -g -lm -O3 -use_fast_math -arch sm_80 ex_particle_CUDA_naive_seq.cu -o particlefilter_naive
float: ex_particle_CUDA_float_seq.cu
$(CC) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -lcuda -g -lm -O3 -use_fast_math -arch sm_80 ex_particle_CUDA_float_seq.cu -o particlefilter_float
clean:
rm particlefilter_naive particlefilter_float
res:
VIDEO SEQUENCE TOOK 0.034583
TIME TO GET NEIGHBORS TOOK: 0.000005
TIME TO GET WEIGHTSTOOK: 0.000005
TIME TO SET ERROR TOOK: 0.000149
TIME TO GET LIKELIHOODS TOOK: 0.000205
TIME TO GET EXP TOOK: 0.000021
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000005
XE: 64.432825
YE: 64.430852
0.610713
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000005
SENDING TO GPU TOOK: 0.000069
CUDA EXEC TOOK: 0.000125
SENDING BACK FROM GPU TOOK: 0.000033
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000233
TIME TO RESET WEIGHTS TOOK: 0.000003
TIME TO SET ERROR TOOK: 0.000138
TIME TO GET LIKELIHOODS TOOK: 0.000197
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000001
XE: 62.365991
YE: 65.436600
2.175731
TIME TO CALC CUM SUM TOOK: 0.000005
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000056
CUDA EXEC TOOK: 0.000120
SENDING BACK FROM GPU TOOK: 0.000024
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000205
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000140
TIME TO GET LIKELIHOODS TOOK: 0.000185
TIME TO GET EXP TOOK: 0.000014
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 60.497261
YE: 66.539564
4.326496
TIME TO CALC CUM SUM TOOK: 0.000005
TIME TO CALC U TOOK: 0.000001
SENDING TO GPU TOOK: 0.000056
CUDA EXEC TOOK: 0.000118
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000203
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000144
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000005
XE: 58.636936
YE: 67.376260
6.337317
TIME TO CALC CUM SUM TOOK: 0.000007
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000034
SENDING BACK FROM GPU TOOK: 0.000024
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000120
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000135
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000014
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000001
XE: 56.268162
YE: 68.101485
8.752343
TIME TO CALC CUM SUM TOOK: 0.000004
TIME TO CALC U TOOK: 0.000004
SENDING TO GPU TOOK: 0.000054
CUDA EXEC TOOK: 0.000128
SENDING BACK FROM GPU TOOK: 0.000022
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000210
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000142
TIME TO GET LIKELIHOODS TOOK: 0.000185
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000004
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 54.499444
YE: 69.650363
11.053830
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000141
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000226
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 52.481617
YE: 70.550595
13.250791
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000003
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000139
SENDING BACK FROM GPU TOOK: 0.000022
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000223
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 50.406399
YE: 71.369707
15.462813
TIME TO CALC CUM SUM TOOK: 0.000011
TIME TO CALC U TOOK: 0.000003
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000088
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000175
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000189
TIME TO GET EXP TOOK: 0.000016
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 48.546236
YE: 72.165669
17.478471
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000005
SENDING TO GPU TOOK: 0.000054
CUDA EXEC TOOK: 0.000142
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000225
TIME TO RESET WEIGHTS TOOK: 0.000002
PARTICLE FILTER TOOK 0.291389
ENTIRE PROGRAM TOOK 0.325972
VIDEO SEQUENCE TOOK 0.014288
TIME TO SEND TO GPU: 0.000087
GPU Execution: 0.004080
FREE TIME: 0.000026
TIME TO SEND BACK: 0.000080
SEND ARRAY X BACK: 0.000022
SEND ARRAY Y BACK: 0.000017
SEND WEIGHTS BACK: 0.000015
XE: 48.546236
YE: 72.165669
17.478471
PARTICLE FILTER TOOK 0.255126
ENTIRE PROGRAM TOOK 0.269414
- 视频序列处理时间:
VIDEO SEQUENCE TOOK 0.034583
,表示视频序列的处理时间非常短。 - 内存分配:
ALLOCATE CPU MEMORY AND GPU MEMORY
的时间占比很高,表明内存的分配是整个过程中的一个重要瓶颈。 - 计算阶段:
RUN COMPUTATION
的时间为0.295287013054
秒,表明主要计算阶段的消耗。- CUDA 执行时间
CUDA EXEC TOOK
很短,显示 GPU 的计算效率很高。
- 位置估计:
XE
和YE
的输出展示了粒子滤波后的估计位置,随着迭代,位置值逐渐变化。
20.pathfinder
CC := $(CUDA_DIR)/bin/nvcc -keep
INCLUDE := $(CUDA_DIR)/include
SRC = pathfinder.cu
EXE = pathfinder
release:
$(CC) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR)
clang: $(SRC)
clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING
pyramidHeight: 20
gridSize: [100000]
border:[20]
blockSize: 256
blockGrid:[463]
targetBlock:[216]
21. srad
srad1:
CC := $(CUDA_DIR)/bin/nvcc -keep
Time spent in different stages of the application:
0.000000000000 s, 0.000000000000 % : SETUP VARIABLES
0.000005000000 s, 0.000981340767 % : READ COMMAND LINE PARAMETERS
0.052648998797 s, 10.333321571350 % : READ IMAGE FROM FILE
0.000437999988 s, 0.085965454578 % : RESIZE IMAGE
0.431991994381 s, 84.786270141602 % : GPU DRIVER INIT, CPU/GPU SETUP, MEMORY ALLOCATION
0.000071000002 s, 0.013935038820 % : COPY DATA TO CPU->GPU
0.000022000000 s, 0.004317899700 % : EXTRACT IMAGE
0.007325000130 s, 1.437664270401 % : COMPUTE
0.000003000000 s, 0.000588804483 % : COMPRESS IMAGE
0.000486000004 s, 0.095386326313 % : COPY DATA TO GPU->CPU
0.015652999282 s, 3.072185516357 % : SAVE IMAGE INTO FILE
0.000862999994 s, 0.169379413128 % : FREE MEMORY
Total time:
0.509507000446 s
srad2:
(base) u200810220@n1:~/cuda/rodinia/gpu-rodinia/cuda/srad/srad_v2$ bash run
WG size of kernel = 16 X 16
Randomizing the input matrix
Start the SRAD main loop
Computation Done
22.streamcluster
NVCC = $(CUDA_DIR)/bin/nvcc -keep
# make dp=1 compiles the CUDA kernels with double-precision support
ifeq ($(dp),1)
NVCC_FLAGS += --gpu-name sm_80
PARSEC Benchmark Suite
read 65536 points
finish local search
time = 5.873713s
time pgain = 0.000000s
time pgain_dist = 0.000000s
time pgain_init = 0.000000s
time pselect = 0.000300s
time pspeedy = 0.390567s
time pshuffle = 0.004491s
time localSearch = 5.743525s
====CUDA Timing info (pgain)====
time serial = 1.350207s
time CPU to GPU memory copy = 0.618665s
time GPU to CPU memory copy back = 0.695247s
time GPU malloc = 0.333729s
time GPU free = 1.709554s
time kernel = 0.121225s