基于radinia的benchmark示例

本文链接：https://blog.csdn.net/2303_77224751/article/details/142796840

Radinia

NVCC version：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:CB:00.0 Off |                    0 |
| N/A   25C    P0    53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1. b+tree

WG size of kernel 1 & 2  = 256 
Selecting device 0
Input File: ../../data/b+tree/mil.txt 
Command File: ../../data/b+tree/command.txt 
Command Buffer: 
j 6000 3000
k 10000


Getting input from file ../../data/b+tree/mil.txt...
Transforming data to a GPU suitable structure...
Tree transformation took 0.083823
Waiting for command
> 
******command: j count=6000, rSize=6000 
knodes_elem=7874, knodes_unit_mem=2068, knodes_mem=16283432
**of blocks = 6000, # of threads/block = 256 (ensure that device can handle)**

Time spent in different stages of GPU_CUDA KERNEL:
 0.079342000186 s, 95.179946899414 % : GPU: SET DEVICE / DRIVER INIT
 0.000395000010 s,  0.473848342896 % : GPU MEM: ALO
 0.003152000019 s,  3.781190156937 % : GPU MEM: COPY IN
 0.000061999999 s,  0.074376203120 % : GPU: KERNEL
 0.000048999998 s,  0.058781191707 % : GPU MEM: COPY OUT
 0.000360000005 s,  0.431861817837 % : GPU MEM: FRE
Total time:
0.083360001445 s
> > > > > > > > > > > > 
 ******command: k count=10000 
records_elem=1000000, records_unit_mem=4, records_mem=4000000
knodes_elem=7874, knodes_unit_mem=2068, knodes_mem=16283432

**of blocks = 10000, # of threads/block = 256 (ensure that device can handle)**
Time spent in different stages of GPU_CUDA KERNEL:
 0.000007000000 s,  0.140731811523 % : GPU: SET DEVICE / DRIVER INIT
 0.000514000014 s, 10.333735466003 % : GPU MEM: ALO
 0.003879999975 s, 78.005630493164 % : GPU MEM: COPY IN
 0.000059000002 s,  1.186168074608 % : GPU: KERNEL
 0.000032000000 s,  0.643345415592 % : GPU MEM: COPY OUT
 0.000482000003 s,  9.690389633179 % : GPU MEM: FRE
Total time:
0.004974000156 s

2.cfd

3.bfs:

Reading File
Read File
Copied Everything to GPU memory
Start traversing the tree
Kernel Executed 10 times
Result stored in result.txt
Init: 0.000000
MemAlloc: 0.000000
HtoD: 255.552017
Exec: 0.275000
DtoH: 0.281000
Close: 0.217000
Total: 266.488007

4.CFD：

makefile-> compute_80 :


#!/bin/bash

#定义多个 CFD 程序

CFD_PROGRAMS=("euler3d" "pre_euler3d" "momentum")

echo "There are three datasets:"

#对每个 CFD 程序运行所有数据集

for CFD_PROGRAM in "${CFD_PROGRAMS[@]}"; do
   echo "Running with $CFD_PROGRAM:"
   
   ./"$CFD_PROGRAM" ../../data/cfd/fvcorr.domn.097K
   ./"$CFD_PROGRAM" ../../data/cfd/fvcorr.domn.193K
   ./"$CFD_PROGRAM" ../../data/cfd/missile.domn.0.2M
   
   echo "Done with $CFD_PROGRAM."
done

There are three datasets:
Running with euler3d:
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name:                     NVIDIA A100-SXM4-40GB
Starting...
7.84005e-05 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name:                     NVIDIA A100-SXM4-40GB
Starting...
0.000109652 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
WG size of kernel:initialize = 192, WG size of kernel:compute_step_factor = 192, WG size of kernel:compute_flux = 192, WG size of kernel:time_step = 192
Name:                     NVIDIA A100-SXM4-40GB
Starting...
0.000209228 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Done with euler3d.
Running with pre_euler3d:
Name:                     NVIDIA A100-SXM4-40GB
Starting...
0.000141933 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Name:                     NVIDIA A100-SXM4-40GB
Starting...
0.000265089 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Name:                     NVIDIA A100-SXM4-40GB
Starting...
0.000335518 seconds per iteration
Saving solution...
Saved solution...
Cleaning up...
Done...
Done with pre_euler3d.
Running with momentum:
run: line 16: ./momentum: Permission denied
run: line 17: ./momentum: Permission denied
run: line 18: ./momentum: Permission denied
Done with momentum.

5. dwt2d

Makefile-> :

# NVCC Options
NVCCFLAGS += -arch sm_80 -keep

result:

Using device 0: NVIDIA A100-SXM4-40GB
Source file:            192.bmp
 Dimensions:            192x192
 Components count:      3
 Bit depth:             8
 DWT levels:            3
 Forward transform:     1
 9/7 transform:         0
Loading ipnput: ../../data/dwt2d/192.bmp
precteno 110592, inputsize 110592

*** 3 stages of 2D forward DWT:

 sliding steps = 2 , gx = 3 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 2 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 1 , gy = 6 
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:

 sliding steps = 2 , gx = 3 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 2 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 1 , gy = 6 
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:

 sliding steps = 2 , gx = 3 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 2 , gy = 12 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 1 , gx = 1 , gy = 6 
fdwt53Kernel in launchFDWT53Kernel has finished
Writing to 192.bmp.dwt.r (192 x 192)

Writing to 192.bmp.dwt.g (192 x 192)

Writing to 192.bmp.dwt.b (192 x 192)
192.bmp.dwt.b          common.ptx                 components.ptx          dwt.h               fdwt53.ptx             main.cpp4.ii         rdwt53.cpp4.ii         rdwt97.cudafe1.gpu
192.bmp.dwt.g          common.sm_80.cubin         components.sm_80.cubin  dwt_kernel.c.copy      fdwt53.sm_80.cubin     main.cu              rdwt53.cudafe1.c       rdwt97.cudafe1.stub.c
192.bmp.dwt.r          components.cpp1.ii         dwt2d                   dwt.module_id          fdwt97.cpp1.ii         main.cudafe1.c       rdwt53.cudafe1.cpp     rdwt97.fatbin
autorun.sh             components.cpp4.ii         dwt.cpp1.ii             dwt.ptx               fdwt97.cpp4.ii         main.cudafe1.cpp     rdwt53.cudafe1.gpu     rdwt97.fatbin.c
common.cpp1.ii         components.cu              dwt.cpp4.ii             dwt.sm_80.cubin        fdwt97.cudafe1.c       main.cudafe1.gpu     rdwt53.cudafe1.stub.c  rdwt97.module_id
common.cpp4.ii         components.cudafe1.c       dwt.cu                  fdwt53.cpp1.ii         fdwt97.cudafe1.cpp     main.cudafe1.stub.c  rdwt53.fatbin          rdwt97.ptx
common.cudafe1.c       components.cudafe1.cpp     dwt_cuda                fdwt53.cpp4.ii         fdwt97.cudafe1.gpu     main.cu.o            rdwt53.fatbin.c        rdwt97.sm_80.cubin
common.cudafe1.cpp     components.cudafe1.gpu     dwt.cudafe1.c           fdwt53.cudafe1.c       fdwt97.cudafe1.stub.c  main.fatbin          rdwt53.module_id       README
common.cudafe1.gpu     components.cudafe1.stub.c  dwt.cudafe1.cpp         fdwt53.cudafe1.cpp     fdwt97.fatbin          main.fatbin.c        rdwt53.ptx             result.txt
common.cudafe1.stub.c  components.cu.o            dwt.cudafe1.gpu         fdwt53.cudafe1.gpu     fdwt97.fatbin.c        main.module_id       rdwt53.sm_80.cubin     run.sh
common.fatbin          components.fatbin          dwt.cudafe1.stub.c      fdwt53.cudafe1.stub.c  fdwt97.module_id       main.ptx             rdwt97.cpp1.ii
common.fatbin.c        components.fatbin.c        dwt.cu.o                fdwt53.fatbin          fdwt97.ptx             main.sm_80.cubin     rdwt97.cpp4.ii
common.h               components.h               dwt.fatbin              fdwt53.fatbin.c        fdwt97.sm_80.cubin     Makefile             rdwt97.cudafe1.c
common.module_id       components.module_id       dwt.fatbin.c            fdwt53.module_id       main.cpp1.ii           rdwt53.cpp1.ii       rdwt97.cudafe1.cpp
Using device 0: NVIDIA A100-SXM4-40GB
Source file:            rgb.bmp
 Dimensions:            1024x1024
 Components count:      3
 Bit depth:             8
 DWT levels:            3
 Forward transform:     1
 9/7 transform:         0
Loading ipnput: ../../data/dwt2d/rgb.bmp
precteno 3145728, inputsize 3145728

*** 3 stages of 2D forward DWT:

 sliding steps = 9 , gx = 6 , gy = 15 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 5 , gx = 4 , gy = 13 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 3 , gx = 4 , gy = 11 
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:

 sliding steps = 9 , gx = 6 , gy = 15 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 5 , gx = 4 , gy = 13 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 3 , gx = 4 , gy = 11 
fdwt53Kernel in launchFDWT53Kernel has finished
*** 3 stages of 2D forward DWT:

 sliding steps = 9 , gx = 6 , gy = 15 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 5 , gx = 4 , gy = 13 
fdwt53Kernel in launchFDWT53Kernel has finished
 sliding steps = 3 , gx = 4 , gy = 11 
fdwt53Kernel in launchFDWT53Kernel has finished
Writing to rgb.bmp.dwt.r (1024 x 1024)

Writing to rgb.bmp.dwt.g (1024 x 1024)

Writing to rgb.bmp.dwt.b (1024 x 1024)

使用设备：NVIDIA A100-SXM4-40GB

输入图像

第一幅图像是 192.bmp，尺寸为 192x192，包含 3 个颜色通道（RGB），每个通道的位深度为 8 位。
第二幅图像是 rgb.bmp，尺寸为 1024x1024，同样包含 3 个颜色通道（RGB），位深度为 8 位。

DWT 计算参数

DWT 层级（DWT levels）：3 层
前向变换（Forward transform）：启用
选择了 5/3 小波变换（9/7 transform: 0 意味着使用 5/3 变换，而不是 9/7 变换）

*DWT 分析

每个图像经历了 3 个阶段的二维前向 DWT 变换，程序分步骤显示了每一层中滑动步长（sliding steps）和网格划分参数 gx 和 gy：

滑动步长：描述在 DWT 过程中，图像数据如何在 GPU 上分块处理。
gx 和 gy：表示每个阶段在 x 和 y 方向上网格的划分。它们用于决定 CUDA 核函数（Kernel）如何在 GPU 上进行并行计算。

以 192.bmp 为例：

第一阶段：滑动步长为 2，x 方向分为 3 块，y 方向分为 12 块。
第二阶段：滑动步长为 1，x 方向分为 2 块，y 方向分为 12 块。
第三阶段：滑动步长为 1，x 方向分为 1 块，y 方向分为 6 块。

类似的步骤也应用于较大尺寸的 rgb.bmp 图像，其中每一层都展示了不同的滑动步长和网格划分。

Kernel 执行

每个阶段的 DWT 变换由 fdwt53Kernel CUDA 核函数执行，程序多次显示 fdwt53Kernel in launchFDWT53Kernel has finished，表示每个核函数的执行完成。

6.gaussian

makefile ->:

release: $(SRC)
	$(CC) $(KERNEL_DIM) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

clang: $(SRC)
	clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
		-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING

res:

WG size of kernel 1 = 512, WG size of kernel 2= 4 X 4
Total Device found: 1
Device Name              - NVIDIA A100-SXM4-40GB 

Total Global Memory                      - 41485888 KB
Shared memory available per block        - 48 KB
Number of registers per thread block     - 65536
Warp size in threads                     - 32
Memory Pitch                             - 2147483647 bytes
Maximum threads per block                - 1024
Maximum Thread Dimension (block)         - 1024 1024 64
Maximum Thread Dimension (grid)          - 2147483647 65535 65535
Total constant memory                    - 65536 bytes
CUDA ver                                 - 8.0
Clock rate                               - 1410000 KHz
Texture Alignment                        - 512 bytes
Device Overlap                           - Allowed
Number of Multi processors               - 108

Read file from ../../data/gaussian/matrix4.txt 

Time total (including memory transfers) 0.161765 sec
Time for CUDA kernels:  0.000088 sec
WG size of kernel 1 = 512, WG size of kernel 2= 4 X 4
Total Device found: 1
Device Name              - NVIDIA A100-SXM4-40GB 

Total Global Memory                      - 41485888 KB
Shared memory available per block        - 48 KB
Number of registers per thread block     - 65536
Warp size in threads                     - 32
Memory Pitch                             - 2147483647 bytes
Maximum threads per block                - 1024
Maximum Thread Dimension (block)         - 1024 1024 64
Maximum Thread Dimension (grid)          - 2147483647 65535 65535
Total constant memory                    - 65536 bytes
CUDA ver                                 - 8.0
Clock rate                               - 1410000 KHz
Texture Alignment                        - 512 bytes
Device Overlap                           - Allowed
Number of Multi processors               - 108

Create matrix internally in parse, size = 16 

Time total (including memory transfers) 0.073124 sec
Time for CUDA kernels:  0.000326 sec

设备信息

设备名称：NVIDIA A100-SXM4-40GB
总全局内存：41485888 KB（约 41 GB）
每个块可用的共享内存：48 KB
每个线程块的寄存器数：65536
Warp 大小：32 线程
每个块的最大线程数：1024
最大线程维度（块）：1024x1024x64
最大线程维度（网格）：2147483647x65535x65535
常量内存：65536 字节（64 KB）
CUDA 版本：8.0
时钟频率：1410000 KHz（1.41 GHz）
纹理对齐：512 字节
多处理器数量：108（表示该 GPU 具有 108 个 SM，多流多处理器）

工作组（Workgroup）大小

内核 1 的工作组大小：512
内核 2 的工作组大小：4x4（即 16 线程）

执行时间

总时间（包括内存传输）
- 第一次运行：0.161765 秒
- 第二次运行：0.073124 秒
CUDA 内核执行时间
- 第一次运行：0.000088 秒
- 第二次运行：0.000326 秒

总体上看，内核执行时间非常短，意味着计算部分的工作量较小，主要的时间花费在数据传输和其他非计算部分上。两次运行的时间差异主要可能是由于输入数据的不同，或者缓存影响。

7.heartwall

makefile ->

# compile main function file into object (binary)
main.o: main.cu kernel.cu define.c
	nvcc $(OUTPUT) $(KERNEL_DIM) main.cu -I./AVI -c -O3 -keep

res:

WG size of kernel = 256 
frame progress: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

8.hotspot

makefile ->

release: $(SRC)
	$(CC) $(KERNEL_DIM) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

enum: $(SRC)
	$(CC) $(KERNEL_DIM) -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

debug: $(SRC)
	$(CC) $(KERNEL_DIM) -g $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

debugenum: $(SRC)
	$(CC) $(KERNEL_DIM) -g -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

res:

WG size of kernel = 16 X 16
pyramidHeight: 2
gridSize: [512, 512]
border:[2, 2]
blockGrid:[43, 43]
targetBlock:[12, 12]
Start computing the transient temperature
Ending simulation

9.hotspot3d

10.huffman

mak---->

NVCC_OPTS=-O3 -arch=sm_80 -Xcompiler -m64 -g -G -keep

res:

CUDA initialized.
CUDA! Starting VLC Tests!
Parameters: num_elements: 262144, num_blocks: 1024, num_block_threads: 256

Time to generate:  0.3 ms

../../data/huffman/test1024_H2.206587175259.in, 1048576 bytes, entropy 2.206587

CPU Encoding time (CPU): 12.941000 (ms)
CPU Encoded to 291334 [B]
GPU Encoding time (SM64HUFF): 0.183190 (ms)
Num_blocks to be passed to scan is 1024.
Comparing vectors: 
PASS! vectors are matching!

11.hybridsort

12.kmeans

gcc -g -fopenmp -O2  cluster.o getopt.o kmeans.o kmeans_clustering.o kmeans_cuda.o rmse.o -o kmeans -L/usr/local/cuda/lib64 -lcuda -lcudart -lm
gcc: error: kmeans.o: No such file or directory
Makefile:26: recipe for target 'kmeans' failed
make: *** [kmeans] Error 1

13.lavemd

mak:

# OMP_FLAG = 	-Xcompiler paste_one_here
CUDA_FLAG = -arch sm_80 -keep

res:

thread block size of kernel = 128 
Configuration used: boxes1d = 10
Time spent in different stages of GPU_CUDA KERNEL:
 0.411684006453 s, 98.182235717773 % : GPU: SET DEVICE / DRIVER INIT
 0.000730999978 s,  0.174335688353 % : GPU MEM: ALO
 0.001612000051 s,  0.384444773197 % : GPU MEM: COPY IN
 0.003311000066 s,  0.789638161659 % : GPU: KERNEL
 0.001444999943 s,  0.344617068768 % : GPU MEM: COPY OUT
 0.000523000024 s,  0.124729916453 % : GPU MEM: FRE
Total time:
0.419306010008 s

14. leukocyte

./meschach_lib/meschach.a -L/usr/local/cuda/lib64 -lm -lcuda -lcudart
gcc: error: avilib.o: No such file or directory
gcc: error: find_ellipse.o: No such file or directory
gcc: error: track_ellipse.o: No such file or directory
Makefile:32: recipe for target 'leukocyte' failed
make[1]: *** [leukocyte] Error 1
make[1]: Leaving directory '/home/u200810220/cuda/rodinia/gpu-rodinia/cuda/leukocyte/CUDA'
Makefile:4: recipe for target 'CUDA/leukocyte' failed
make: *** [CUDA/leukocyte] Error 2

15.lud

cuda-make:

NVCC = nvcc -keep

DEFS += \
		-DGPU_TIMER \
		$(SPACE)

NVCCFLAGS += -I../common \
			 -O3 \
			 -use_fast_math \
			 -arch=sm_80 \
			 -lm \
			 $(SPACE)

res:

WG size of kernel = 16 X 16
Generate input matrix internally, size =256
Creating matrix internally size=256
Before LUD
Time consumed(ms): 0.877000
After LUD

16.mummergpu

suffix-tree.cpp:1764:26: error: ‘read’ was not declared in this scope
     while ((bytes_read = read(qfile, buf, sizeof(buf))) != 0)
                          ^~~~
suffix-tree.cpp:1764:26: note: suggested alternative: ‘fread’
     while ((bytes_read = read(qfile, buf, sizeof(buf))) != 0)
                          ^~~~
                          fread
suffix-tree.cpp:1807:34: error: ‘lseek’ was not declared in this scope
                     off_t seek = lseek(qfile, -(bytes_read - i), SEEK_CUR);
                                  ^~~~~
suffix-tree.cpp:1807:34: note: suggested alternative: ‘seek’
                     off_t seek = lseek(qfile, -(bytes_read - i), SEEK_CUR);
                                  ^~~~~
                                  seek
suffix-tree.cpp:1715:33: warning: unused parameter ‘rc’ [-Wunused-parameter]
                            bool rc)
                                 ^~
suffix-tree.cpp: In function ‘int addMatchToBuffer(int, int, int)’:
suffix-tree.cpp:2180:1: warning: no return statement in function returning non-void [-Wreturn-type]
 }
 ^
suffix-tree.cpp: In function ‘void printAlignments(ReferencePage*, Alignment*, char*, int, TextureAddress, int, int, int, bool, bool)’:
suffix-tree.cpp:2491:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if (printParentId == matchNodeId)

       ~~~~~~~~~~~~~~^~~~~~~~~~~~~~
suffix-tree.cpp:2631:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             if (cid == matchNodeId)
                 ~~~~^~~~~~~~~~~~~~
Makefile:114: recipe for target 'obj/release/suffix-tree.cpp_o' failed
make: *** [obj/release/suffix-tree.cpp_o] Error 1
       ~~~~~~~~~~~~~~

17. myocyte

mak:

# link objects(binaries) together
myocyte.out:		main.o
	nvcc	main.o \
				-I/usr/local/cuda/include \
				-L/usr/local/cuda/lib \
				-lm -lcuda -lcudart \
                -keep -o myocyte.out

res:

Time spent in different stages of the application:
0.000000000000 s, 0.000000000000 % : SETUP VARIABLES
1.817788958549 s, 84.736045837402 % : ALLOCATE CPU MEMORY AND GPU MEMORY
0.032156001776 s, 1.498948574066 % : READ DATA FROM FILES
0.295287013054 s, 13.764773368835 % : RUN COMPUTATION
0.000005000000 s, 0.000233074476 % : FREE MEMORY
Total time:
2.145236968994 s

17. nn

LOCAL_CC = gcc -g -O3 -Wall
CC := $(CUDA_DIR)/bin/nvcc -keep

res:

1988 12 27  0 18 TONY       30.0  89.8  113   39 --> Distance=0.199997
1980 10 22 18  3 ISAAC      30.1  90.4  110  778 --> Distance=0.412312
1997 11 14 12 24 HELENE     30.5  89.8  134  529 --> Distance=0.538515
2003  8 27 12 10 TONY       29.9  89.4  160  286 --> Distance=0.608275
1974 12 22 18 24 JOYCE      30.6  89.9   80  593 --> Distance=0.608276

18.nw

release: $(SRC)
	$(CC) ${KERNEL_DIM} $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR)  -keep

clang: $(SRC)
	clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
		-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING -keep

enum: $(SRC)
	$(CC) ${KERNEL_DIM} -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR)  -keep

debug: $(SRC)
	$(CC) ${KERNEL_DIM} -g $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

debugenum: $(SRC)
	$(CC) ${KERNEL_DIM} -g -deviceemu $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -keep

clean: $(SRC)
	rm -f $(EXE) $(EXE).linkinfo result.txt

19.particlefiltter

#makefile

include ../../common/make.config

CC := $(CUDA_DIR)/bin/nvcc -keep

INCLUDE := $(CUDA_DIR)/include

all: naive float

naive: ex_particle_CUDA_naive_seq.cu
	$(CC) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -lcuda -g -lm -O3 -use_fast_math -arch sm_80 ex_particle_CUDA_naive_seq.cu -o particlefilter_naive
	
float: ex_particle_CUDA_float_seq.cu
	$(CC) -I$(INCLUDE) -L$(CUDA_LIB_DIR) -lcuda -g -lm -O3 -use_fast_math -arch sm_80 ex_particle_CUDA_float_seq.cu -o particlefilter_float

clean:
	rm particlefilter_naive particlefilter_float

res：


VIDEO SEQUENCE TOOK 0.034583
TIME TO GET NEIGHBORS TOOK: 0.000005
TIME TO GET WEIGHTSTOOK: 0.000005
TIME TO SET ERROR TOOK: 0.000149
TIME TO GET LIKELIHOODS TOOK: 0.000205
TIME TO GET EXP TOOK: 0.000021
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000005
XE: 64.432825
YE: 64.430852
0.610713
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000005
SENDING TO GPU TOOK: 0.000069
CUDA EXEC TOOK: 0.000125
SENDING BACK FROM GPU TOOK: 0.000033
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000233
TIME TO RESET WEIGHTS TOOK: 0.000003
TIME TO SET ERROR TOOK: 0.000138
TIME TO GET LIKELIHOODS TOOK: 0.000197
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000001
XE: 62.365991
YE: 65.436600
2.175731
TIME TO CALC CUM SUM TOOK: 0.000005
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000056
CUDA EXEC TOOK: 0.000120
SENDING BACK FROM GPU TOOK: 0.000024
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000205
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000140
TIME TO GET LIKELIHOODS TOOK: 0.000185
TIME TO GET EXP TOOK: 0.000014
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 60.497261
YE: 66.539564
4.326496
TIME TO CALC CUM SUM TOOK: 0.000005
TIME TO CALC U TOOK: 0.000001
SENDING TO GPU TOOK: 0.000056
CUDA EXEC TOOK: 0.000118
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000203
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000144
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000005
XE: 58.636936
YE: 67.376260
6.337317
TIME TO CALC CUM SUM TOOK: 0.000007
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000034
SENDING BACK FROM GPU TOOK: 0.000024
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000120
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000135
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000014
TIME TO SUM WEIGHTS TOOK: 0.000002
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000001
XE: 56.268162
YE: 68.101485
8.752343
TIME TO CALC CUM SUM TOOK: 0.000004
TIME TO CALC U TOOK: 0.000004
SENDING TO GPU TOOK: 0.000054
CUDA EXEC TOOK: 0.000128
SENDING BACK FROM GPU TOOK: 0.000022
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000210
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000142
TIME TO GET LIKELIHOODS TOOK: 0.000185
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000004
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 54.499444
YE: 69.650363
11.053830
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000002
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000141
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000226
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000001
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 52.481617
YE: 70.550595
13.250791
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000003
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000139
SENDING BACK FROM GPU TOOK: 0.000022
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000223
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000184
TIME TO GET EXP TOOK: 0.000015
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 50.406399
YE: 71.369707
15.462813
TIME TO CALC CUM SUM TOOK: 0.000011
TIME TO CALC U TOOK: 0.000003
SENDING TO GPU TOOK: 0.000055
CUDA EXEC TOOK: 0.000088
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000175
TIME TO RESET WEIGHTS TOOK: 0.000002
TIME TO SET ERROR TOOK: 0.000139
TIME TO GET LIKELIHOODS TOOK: 0.000189
TIME TO GET EXP TOOK: 0.000016
TIME TO SUM WEIGHTS TOOK: 0.000003
TIME TO NORMALIZE WEIGHTS TOOK: 0.000002
TIME TO MOVE OBJECT TOOK: 0.000002
XE: 48.546236
YE: 72.165669
17.478471
TIME TO CALC CUM SUM TOOK: 0.000008
TIME TO CALC U TOOK: 0.000005
SENDING TO GPU TOOK: 0.000054
CUDA EXEC TOOK: 0.000142
SENDING BACK FROM GPU TOOK: 0.000023
TIME TO CALC NEW ARRAY X AND Y TOOK: 0.000225
TIME TO RESET WEIGHTS TOOK: 0.000002
PARTICLE FILTER TOOK 0.291389
ENTIRE PROGRAM TOOK 0.325972
VIDEO SEQUENCE TOOK 0.014288
TIME TO SEND TO GPU: 0.000087
GPU Execution: 0.004080
FREE TIME: 0.000026
TIME TO SEND BACK: 0.000080
SEND ARRAY X BACK: 0.000022
SEND ARRAY Y BACK: 0.000017
SEND WEIGHTS BACK: 0.000015
XE: 48.546236
YE: 72.165669
17.478471
PARTICLE FILTER TOOK 0.255126
ENTIRE PROGRAM TOOK 0.269414

视频序列处理时间: VIDEO SEQUENCE TOOK 0.034583，表示视频序列的处理时间非常短。
内存分配: ALLOCATE CPU MEMORY AND GPU MEMORY 的时间占比很高，表明内存的分配是整个过程中的一个重要瓶颈。
计算阶段:
- RUN COMPUTATION 的时间为 0.295287013054 秒，表明主要计算阶段的消耗。
- CUDA 执行时间 CUDA EXEC TOOK 很短，显示 GPU 的计算效率很高。
位置估计: XE 和 YE 的输出展示了粒子滤波后的估计位置，随着迭代，位置值逐渐变化。

20.pathfinder

CC := $(CUDA_DIR)/bin/nvcc -keep
INCLUDE := $(CUDA_DIR)/include

SRC = pathfinder.cu

EXE = pathfinder

release:
	$(CC) $(SRC) -o $(EXE) -I$(INCLUDE) -L$(CUDA_LIB_DIR) 

clang: $(SRC)
	clang++ $(SRC) -o $(EXE) -I../util --cuda-gpu-arch=sm_80 \
		-L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -DTIMING

pyramidHeight: 20
gridSize: [100000]
border:[20]
blockSize: 256
blockGrid:[463]
targetBlock:[216]

21. srad

srad1:

CC := $(CUDA_DIR)/bin/nvcc -keep


Time spent in different stages of the application:
 0.000000000000 s,  0.000000000000 % : SETUP VARIABLES
 0.000005000000 s,  0.000981340767 % : READ COMMAND LINE PARAMETERS
 0.052648998797 s, 10.333321571350 % : READ IMAGE FROM FILE
 0.000437999988 s,  0.085965454578 % : RESIZE IMAGE
 0.431991994381 s, 84.786270141602 % : GPU DRIVER INIT, CPU/GPU SETUP, MEMORY ALLOCATION
 0.000071000002 s,  0.013935038820 % : COPY DATA TO CPU->GPU
 0.000022000000 s,  0.004317899700 % : EXTRACT IMAGE
 0.007325000130 s,  1.437664270401 % : COMPUTE
 0.000003000000 s,  0.000588804483 % : COMPRESS IMAGE
 0.000486000004 s,  0.095386326313 % : COPY DATA TO GPU->CPU
 0.015652999282 s,  3.072185516357 % : SAVE IMAGE INTO FILE
 0.000862999994 s,  0.169379413128 % : FREE MEMORY
Total time:
0.509507000446 s

srad2:

(base) u200810220@n1:~/cuda/rodinia/gpu-rodinia/cuda/srad/srad_v2$ bash run
WG size of kernel = 16 X 16
Randomizing the input matrix
Start the SRAD main loop
Computation Done

22.streamcluster

NVCC = $(CUDA_DIR)/bin/nvcc -keep
# make dp=1 compiles the CUDA kernels with double-precision support
ifeq ($(dp),1)
	NVCC_FLAGS += --gpu-name sm_80



PARSEC Benchmark Suite
read 65536 points
finish local search
time = 5.873713s
time pgain = 0.000000s
time pgain_dist = 0.000000s
time pgain_init = 0.000000s
time pselect = 0.000300s
time pspeedy = 0.390567s
time pshuffle = 0.004491s
time localSearch = 5.743525s

====CUDA Timing info (pgain)====
time serial = 1.350207s
time CPU to GPU memory copy = 0.618665s
time GPU to CPU memory copy back = 0.695247s
time GPU malloc = 0.333729s
time GPU free = 1.709554s
time kernel = 0.121225s