NVIDIA Jetson AGX Orin 32GB - 200T 性能测试_jetson agx orin 测试unet模型-CSDN博客

本文链接：https://blog.csdn.net/Reasonss/article/details/148062586

结论

硬件验证：符合NVIDIA官方技术规格
AI性能优势：INT8量化模型推理速度达到373帧/秒，相比FP32提升214%
显存性能：芯片内部数据传输速度突破114GB/s，展现先进架构优势
存储性能：顺序写入速度达187MB/s，满足高速数据记录需求

测试报告

类别	测试项	关键指标	实测值	状态
硬件规格	计算架构版本	CUDA架构	8.7	达标 ✅
	显存容量	总内存	32GB	达标 ✅
	GPU核心数	并行计算单元	1792核心	达标 ✅
AI推理性能	YOLOv8模型速度 (INT8)	每秒处理帧数	373帧/秒	最优性能
	YOLOv8模型速度 (FP16)	每秒处理帧数	234帧/秒	高性能
	YOLOv8模型速度 (FP32)	每秒处理帧数	118帧/秒	基础性能
计算性能	CPU单核性能	每秒运算事件	2615次/秒	正常
	CPU多核性能	综合性能得分	Jetson AGX Orin Developer Kit - Geekbench	行业领先
数据传输能力	主机↔设备最大带宽	数据传输速度	26.2 GB/s	高速传输
	芯片内部带宽	内存复制速度	114.5 GB/s	超高速
存储性能	大文件写入速度	顺序写入速度	187 MB/s	优秀
	随机读写能力	混合读写速度	18.9/12.6 MB/s	稳定
	系统响应延迟	95%请求响应时间	0.35毫秒	低延迟

测试方法

核心参数核对

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery


CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 30593 MBytes (32079273984 bytes)
  (014) Multiprocessors, (128) CUDA Cores/MP:    1792 CUDA Cores
  GPU Max Clock rate:                            930 MHz (0.93 GHz)
  Memory Clock rate:                             930 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

AI推理性能测试

# 安装依赖
pip install onnx==1.14.0
sudo apt-get install python3-testresources 
pip install --force-reinstall python-dateutil==2.8.2  
pip install ultralytics
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc 

# 导出YOLOv8s为ONNX
yolo mode=export model=yolov8s.pt format=onnx

# 生成TensorRT引擎并测试（INT8模式）
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8s.onnx \
  --saveEngine=yolov8s.engine \
  --int8 \
  --useCudaGraph


# 测试FP32精度（去掉--int8参数）
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s_fp32.engine

# 测试FP16精度（增加--fp16）
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s_fp16.engine --fp16

#生成引擎以后直接加载测试速度就快了
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --loadEngine=yolov8s_fp16.engine --fp16


# 查看结果（关注以下两行）
# [I] === Performance summary ===
# [I] Throughput:  X qps


int8	373.043 qps
fp16	234.349 qps
fp32	118.852 qps

单核性能测试

sudo apt-get install -y sysbench
sysbench cpu --threads=1 run


CPU speed:
events per second: 2615.41

General statistics:
total time: 10.0001s
total number of events: 26158

多核性能测试

# 安装Geekbench
wget https://cdn.geekbench.com/Geekbench-5.4.1-LinuxARMPreview.tar.gz
tar xvf Geekbench-5.4.1-LinuxARMPreview.tar.gz
cd Geekbench-5.4.1-LinuxARMPreview

# 运行测试（结果自动上传官网）
./geekbench5


# 结果 

https://browser.geekbench.com/v5/cpu/23371424

内存带宽测试

t@t-desktop:/usr/local/cuda/samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Orin
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     18.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     26.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     114.5

Result = PASS

磁盘性能测试

# 顺序写测试（生成1GB测试文件）
dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct

1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.73304 s, 187 MB/s


# 随机读写测试（需安装sysbench）
sudo apt-get install -y sysbench

sysbench fileio --file-total-size=2G prepare

2147483648 bytes written in 17.01 seconds (120.38 MiB/sec).

sysbench fileio --file-test-mode=rndrw run


sudo sysbench fileio --file-test-mode=rndrw run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      1211.30
    writes/s:                     807.53
    fsyncs/s:                     2592.01

Throughput:
    read, MiB/s:                  18.93
    written, MiB/s:               12.62

General statistics:
    total time:                          10.0047s
    total number of events:              46007

Latency (ms):
         min:                                    0.00
         avg:                                    0.22
         max:                                   76.12
         95th percentile:                        0.35
         sum:                                 9979.21

Threads fairness:
    events (avg/stddev):           46007.0000/0.00
    execution time (avg/stddev):   9.9792/0.00