NVIDIA Jetson AGX Orin 32GB - 200T 性能测试

结论

  1. 硬件验证:符合NVIDIA官方技术规格
  2. AI性能优势:INT8量化模型推理速度达到373帧/秒,相比FP32提升214%
  3. 显存性能:芯片内部数据传输速度突破114GB/s,展现先进架构优势
  4. 存储性能:顺序写入速度达187MB/s,满足高速数据记录需求

测试报告

类别测试项关键指标实测值状态
硬件规格计算架构版本CUDA架构8.7达标 ✅
显存容量总内存32GB达标 ✅
GPU核心数并行计算单元1792核心达标 ✅
AI推理性能YOLOv8模型速度 (INT8)每秒处理帧数373帧/秒最优性能
YOLOv8模型速度 (FP16)每秒处理帧数234帧/秒高性能
YOLOv8模型速度 (FP32)每秒处理帧数118帧/秒基础性能
计算性能CPU单核性能每秒运算事件2615次/秒正常
CPU多核性能综合性能得分Jetson AGX Orin Developer Kit - Geekbench行业领先
数据传输能力主机↔设备最大带宽数据传输速度26.2 GB/s高速传输
芯片内部带宽内存复制速度114.5 GB/s超高速
存储性能大文件写入速度顺序写入速度187 MB/s优秀
随机读写能力混合读写速度18.9/12.6 MB/s稳定
系统响应延迟95%请求响应时间0.35毫秒低延迟

测试方法

核心参数核对

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery


CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 30593 MBytes (32079273984 bytes)
  (014) Multiprocessors, (128) CUDA Cores/MP:    1792 CUDA Cores
  GPU Max Clock rate:                            930 MHz (0.93 GHz)
  Memory Clock rate:                             930 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS


AI推理性能测试

# 安装依赖
pip install onnx==1.14.0
sudo apt-get install python3-testresources 
pip install --force-reinstall python-dateutil==2.8.2  
pip install ultralytics
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc 

# 导出YOLOv8s为ONNX
yolo mode=export model=yolov8s.pt format=onnx

# 生成TensorRT引擎并测试(INT8模式)
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8s.onnx \
  --saveEngine=yolov8s.engine \
  --int8 \
  --useCudaGraph


# 测试FP32精度(去掉--int8参数)
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s_fp32.engine

# 测试FP16精度(增加--fp16)
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s_fp16.engine --fp16

#生成引擎以后直接加载测试速度就快了
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s.onnx --loadEngine=yolov8s_fp16.engine --fp16


# 查看结果(关注以下两行)
# [I] === Performance summary ===
# [I] Throughput:  X qps


int8	373.043 qps
fp16	234.349 qps
fp32	118.852 qps

单核性能测试

sudo apt-get install -y sysbench
sysbench cpu --threads=1 run


CPU speed:
events per second: 2615.41

General statistics:
total time: 10.0001s
total number of events: 26158

多核性能测试

# 安装Geekbench
wget https://cdn.geekbench.com/Geekbench-5.4.1-LinuxARMPreview.tar.gz
tar xvf Geekbench-5.4.1-LinuxARMPreview.tar.gz
cd Geekbench-5.4.1-LinuxARMPreview

# 运行测试(结果自动上传官网)
./geekbench5


# 结果 

https://browser.geekbench.com/v5/cpu/23371424

内存带宽测试

t@t-desktop:/usr/local/cuda/samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Orin
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     18.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     26.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     114.5

Result = PASS

磁盘性能测试

# 顺序写测试(生成1GB测试文件)
dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct

1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.73304 s, 187 MB/s


# 随机读写测试(需安装sysbench)
sudo apt-get install -y sysbench

sysbench fileio --file-total-size=2G prepare

2147483648 bytes written in 17.01 seconds (120.38 MiB/sec).

sysbench fileio --file-test-mode=rndrw run


sudo sysbench fileio --file-test-mode=rndrw run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      1211.30
    writes/s:                     807.53
    fsyncs/s:                     2592.01

Throughput:
    read, MiB/s:                  18.93
    written, MiB/s:               12.62

General statistics:
    total time:                          10.0047s
    total number of events:              46007

Latency (ms):
         min:                                    0.00
         avg:                                    0.22
         max:                                   76.12
         95th percentile:                        0.35
         sum:                                 9979.21

Threads fairness:
    events (avg/stddev):           46007.0000/0.00
    execution time (avg/stddev):   9.9792/0.00
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

交叉编译之王 hahaha

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值