Triton Inference Server

github address
install model analysis
yolov4性能分析例子
中文博客介绍
关于服务器延迟,并发性,并发度,吞吐量经典讲解
client py examples
用于模型仓库管理,性能测试工具
1、性能监测,优化
Model Analyzer section帮助你了解model 的 GPU内存使用率 — you can decide how to run multipe models on a single GPU.
提供analysis Concurrency: 1, throughput: 62.6 infer/sec, latency 21371 usec
2、开启自动Dynamic Batcher
就是将并发合并,然后推理
需要关闭Triton,在configuration file那里添加 dynamic_batching { }, restart Triton
3、In general the benefit of the dynamic batcher and multiple instances is model specific, so you should experiment with perf_analyzer to determine the settings that best satisfy your throughput and latency requirements.
4、perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv
将测试数据写到csv里
5、model_analyzer 详细
6、可以生成曲线图等
7、deploying on k8s
8、Performance Analyzer,性能分析器。
Model Analyzer,使用 Performance Analyzer 分析测量一个模型的 GPU 内存和计算使用率。
By default perf_analyzer sends input tensor data and receives output tensor data over the network. You can instead instruct perf_analyzer to use system shared memory or CUDA shared memory to communicate tensor data. By using these options you can model the performance that you can achieve by using shared memory in your application. Use –shared-memory=system to use system (CPU) shared memory or –shared-memory=cuda to use CUDA shared memory.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值